Disentangling complex genomic signals to understand population structure of an exploited, estuarine-dependent flatfish.

Shannon J. O’Leary, Christopher M. Hollenbeck, Robert R. Vega, David S. Portnoy

Data sets and session info

All SNP-containing loci/all individuals grouped by estuary.

## /// GENIND OBJECT /////////
## 
##  // 316 individuals; 4,122 loci; 44,170 alleles; size: 61.4 Mb
## 
##  // Basic content
##    @tab:  316 x 44170 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-80)
##    @loc.fac: locus factor for the 44170 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: write_genind(data = tidy)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 2-62)
##    @strata: a data frame with 32 columns ( LIB_ID, WELL, SAMPLE_ID, OCEAN, REGION1, REGION, ... )

All SNP-containing loci/individuals from northern Gulf of Mexico grouped by sampled estuary.

## /// GENIND OBJECT /////////
## 
##  // 248 individuals; 4,122 loci; 44,170 alleles; size: 49.9 Mb
## 
##  // Basic content
##    @tab:  248 x 44170 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-80)
##    @loc.fac: locus factor for the 44170 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: .local(x = x, i = i, j = j, pop = ..1, treatOther = ..2, quiet = ..3, 
##     drop = drop)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 248-248)
##    @strata: a data frame with 32 columns ( LIB_ID, WELL, SAMPLE_ID, OCEAN, REGION1, REGION, ... )

All SNP-containing loci/individuals from US Atlantic grouped by sampled estuary.

## /// GENIND OBJECT /////////
## 
##  // 68 individuals; 4,122 loci; 44,170 alleles; size: 19.5 Mb
## 
##  // Basic content
##    @tab:  68 x 44170 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-80)
##    @loc.fac: locus factor for the 44170 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: .local(x = x, i = i, j = j, pop = ..1, treatOther = ..2, quiet = ..3, 
##     drop = drop)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 68-68)
##    @strata: a data frame with 32 columns ( LIB_ID, WELL, SAMPLE_ID, OCEAN, REGION1, REGION, ... )

1 Sampling design

Tissues (fin clips) were taken from southern flounder young of the year (YOY), juveniles, and adults in estuaries throughout the Gulf from San Antonio Bay, TX to Apalachicola Bay, FL and in the Atlantic from St. John’s River, FL to Winyah Bay, SC. Additional fin clips from adults were obtained offshore of Texas and Louisiana which were assigned to the nearest estuary where possible. Samples are grouped by estuary within ocean basins (GULF, ATL).

## OGR data source with driver: ESRI Shapefile 
## Source: "/home/soleary/FLOUNDER/SFL_PopGen/data/BASEMAPS", layer: "ne_10m_land"
## with 1 features
## It has 2 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/soleary/FLOUNDER/SFL_PopGen/data/BASEMAPS", layer: "ne_10m_admin_1_states_provinces_lines"
## with 10118 features
## It has 21 fields

Sample distribution of southern flounder young-of-the-year, juveniles, and adults sampled in estuaries throughout the Gulf of Mexico and Western Atlantic. Abbreviations listed in table below.

Sample distribution of southern flounder young-of-the-year, juveniles, and adults sampled in estuaries throughout the Gulf of Mexico and Western Atlantic. Abbreviations listed in table below.

Southern flounder are assumed to be approximately 1 year at 25 cm (red dashed line), 2 years at 40 cm (blue line). Males are generally smaller, females are considered to be mature at approx 40 cm.

Distribution of total length [mm] across all sample locations. Dashed lines indicates approximate size at one (red) and two (blue) years old.

Distribution of total length [mm] across all sample locations. Dashed lines indicates approximate size at one (red) and two (blue) years old.

Sample size per estuary.

ESTUARY Abbreviation n
ApalachicolaBay AP 44
BaratariaBay BB 39
CharlestonHarbor CH 18
EastMississippiSound EMS 5
GalvestonBay GB 2
MatagordaBay MAT 3
MobileBay MB 62
SabineLake SL 24
SanAntonioBay SA 23
SanteeRivers SR 4
StHelenSound SHS 4
StJohnsRiver SJR 20
Stono-NorthEdistoRivers ST 4
WestMississippiSound WMS 34
WinyahBay WB 18

Sample size per ocean basin.

OCEAN n
GULF 248
ATL 68

Sample size per estuary and age class. YOY defined as fished caught at 25cm or less, juveniles 25 - 40cm, and adults as > 40cm.

ESTUARY ADULT JUVENILE YOY NA
ApalachicolaBay 1 3 40 NA
BaratariaBay 7 32 NA NA
CharlestonHarbor 7 9 2 NA
EastMississippiSound 3 2 NA NA
GalvestonBay NA 2 NA NA
MatagordaBay NA NA 3 NA
MobileBay 48 11 NA 3
SabineLake 2 19 3 NA
SanAntonioBay 3 10 10 NA
SanteeRivers 2 2 NA NA
StHelenSound 2 2 NA NA
StJohnsRiver NA NA 20 NA
Stono-NorthEdistoRivers 1 NA 3 NA
WestMississippiSound 9 22 2 1
WinyahBay 7 8 3 NA

Year sampled per estuary for YOY (defined as < 25cm).

ESTUARY 2013 2014 2015 2016 NA
CharlestonHarbor 1 NA 1 NA NA
WinyahBay 3 NA NA NA NA
Stono-NorthEdistoRivers NA 3 NA NA NA
WestMississippiSound NA NA 2 NA NA
ApalachicolaBay NA NA NA 40 NA
MatagordaBay NA NA NA 3 NA
SabineLake NA NA NA 3 NA
SanAntonioBay NA NA NA 9 1
StJohnsRiver NA NA NA 20 NA

2 Genotyping

See Genotyping.Rmd for details/scripts for SNP calling, filtering, and haplotyping.

DNA was extracted using Mag-Bind Blood and Tissue DNA kits (Omega Bio-Tek). Double digest restriction-site associated DNA (ddRAD) libraries were constructed using a modified protocol (Portnoy et al. 2015Portnoy, D. S., J. B. Puritz, C. M. Hollenbeck, J. Gelsleichter, D. Chapman, and J. R. Gold. 2015. “Selection and sex-biased dispersal in a coastal shark: The influence of philopatry on adaptive variation.” Molecular Ecology 24 (23): 5877–85. https://doi.org/10.1111/mec.13441.) and sequenced on four separate lanes of an Illumina HiSeq 2500.

Raw sequences were demultiplexed using process_radtags (Catchen et al. 2013Catchen, Julian, Paul A. Hohenlohe, Susan Bassham, Angel Amores, and William A. Cresko. 2013. “Stacks: An analysis tool set for population genomics.” Molecular Ecology 22 (11): 3124–40. https://doi.org/10.1111/mec.12354.). Quality trimming, read mapping, and SNP calling were performed using the dDocent pipeline (Puritz, Hollenbeck, and Gold 2014Puritz, Jonathan B, Christopher M Hollenbeck, and John R Gold. 2014. “dDocent : a RADseq, variant-calling pipeline designed for population genomics of non-model organisms.” PeerJ 2: e431. https://doi.org/10.7717/peerj.431.) and a reduced-representation reference genome previously produced for southern flounder (approximately 2 – 5% of the genome).

Raw SNPs were rigorously filtered using VCFtools (Danecek et al. 2011Danecek, Petr, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, et al. 2011. “The variant call format and VCFtools.” Bioinformatics 27 (15): 2156–8. https://doi.org/10.1093/bioinformatics/btr330.) and custom scripts following (???). Final filtering thresholds were a sequence quality of 20, a minimum genotype call rate per locus of 90%, a minor allele count of 3, a minimum genotype depth of 3, and a mean minimum depth of 15. Individuals with > 85% missing data were removed. SNPs were further filtered based on allele balance, quality/depth ratio, mapping quality ratio of reference/alternate alleles, properly paired status, strand representation, and maximum depth. Finally, rad_haplotyper (Willis et al. 2017Willis, Stuart C., Christopher M. Hollenbeck, Jonathan B. Puritz, John R. Gold, and David S. Portnoy. 2017. “Haplotyping RAD loci: an efficient method to filter paralogs and account for physical linkage.” Molecular Ecology Resources, February. https://doi.org/10.1111/1755-0998.12647.) was used to collapse SNPs on the same contig into haplotypes producing a final data set consisting of SNP-containing loci (hereafter loci) for data analysis.

3 Fst-outlier analysis

Presence of \(F_{ST}\)-outlier loci was assessed using two methods, the FDIST method (Beaumont and Nichols 1996Beaumont, M. A., and R. A. Nichols. 1996. “Evaluating loci for use in the genetic analysis of population structure.” Proceedings of the Royal Society B: Biological Sciences 263 (1377): 1619–26. https://doi.org/10.1098/rspb.1996.0237.) as implemented in Arlequin (Excoffier and Lischer 2010Excoffier, Laurent, and Heidi E. L. Lischer. 2010. “Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows.” Molecular Ecology Resources 10 (3): 564–67. https://doi.org/10.1111/j.1755-0998.2010.02847.x.), and a Bayesian modeling method implemented in BayeScan (Foll and Gaggiotti 2008Foll, Matthieu, and Oscar Gaggiotti. 2008. “A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: A Bayesian perspective.” Genetics 180 (2): 977–93. https://doi.org/10.1534/genetics.108.092221.). For both methods, loci with significantly higher \(F_{ST}\) than expected under a neutral model are considered as loci putatively under directional selection. Given low background \(F_{ST}\) values typical of marine fish data sets (Knutsen et al. 2011Knutsen, H., E. M. Olsen, P. E. Jorde, S. H. Espeland, C. André, and N. C. Stenseth. 2011. “Are low but statistically significant levels of genetic differentiation in marine fishes ’biologically meaningful’? A case study of coastal Atlantic cod.” Molecular Ecology 20 (4): 768–83. https://doi.org/10.1111/j.1365-294X.2010.04979.x.), an assessment for loci putatively under balancing selection (\(F_{ST}\) significantly lower than expected) was not conducted.

Analysis in Arlequin was based on 20,000 coalescent simulations using a strict island model. To account for multiple testing, p-values were corrected assuming a false discovery rate (FDR) < 0.05. BayeScan runs consisted of 25 pilot runs of 5,000 iterations, followed by a total of 550,000 iterations (burn-in of 50,000 iterations, 10,000 samples with thinning interval of 50). A FDR of 0.05 was used as the threshold for outlier detection.

For both methods, \(F_{ST}\)-outlier analysis was run for all individuals grouped by basin, and separately for Gulf and Atlantic individuals grouped by estuary. The distribution of loci flagged during \(F_{ST}\)-outlier analysis loci across linkage groups (chromosomes) was assessed using a previously established linkage map.

3.1 All individuals (grouped by ocean basin)

Bayesian maximum likelihood (multinomial-Dirichlet model)

BayeScan uses differences in allele frequencies between populations to identify \(F_{ST}\)-outlier loci. The null model (i.e. distribution of \(F_{ST}\) for neutral loci) is generated based on the multinomial-Dirichlet model which assumes allele frequencies of sub-populations are correlated through a common migrant gene pool from which they differ to varying degrees; this difference is measured by a subpopulation-specific \(F_{ST}\) coefficient. Therefore, this model can incorporate ecologically realistic scenarios and is a robust method even when effective sizes and migration rates differ among subpopulations.

Selection is introduced by decomposing population-specific \(F_{ST}\) coefficients into population-specific component (beta), shared by all loci, and locus-specific component (alpha), shared by all poulations using logistic regresstion. If a locus-specific component is necessary to explain observed pattern of diversity (i.e. alpha significantly different from 0), departure from neutrality is inferred.

BayeScan calculates posterior probability for model including selection using bayes factors. Multiple testing is needed to incorporate identifying loci as under selection by chance. The posterior odds are calculated to determine outlier loci. Posterior odds are calculated as the ratio of posterior probabilities indicating how much more likely the model with selection is compared to netural model. Posterior probabilities can be used to directly control FDR (expected proportion of false positives amont outlier loci). The q-value of each locus indicates the minimum FDR at which this locus may become significant, e.g. q-value > 0.05 means that 5% of corresponding outlier markers are expected to be false positives (5% threshold more stringent than corresponding p-value in classic statistics).

Parameters used for BayeScan run

##  [1] "There are 4122 loci."                        
##  [2] ""                                            
##  [3] "There are 2 populations."                    
##  [4] ""                                            
##  [5] "Burn in: 50000"                              
##  [6] "Thining interval: 50"                        
##  [7] "Sample size: 10000"                          
##  [8] "Resulting total number of iterations: 550000"
##  [9] "Nb of pilot runs: 25"                        
## [10] "Length of each pilot run: 5000"              
## [11] ""

Use q-value to determine if a locus is a good candidate for locus being under the influence of selection.

Comparison of log10(qvalue) and Fst per locus.

Comparison of log10(qvalue) and Fst per locus.

A total of 30 loci have a q-value < 0.05.

FDIST method implemented in Arlequin

Coalescent simulations based on island model are used to generate the null distribution for neutral loci based on a given number of populations.

Fst-heterozygosity distribution for samples grouped by ocean basin. Red open circles indicate loci significantly outside the simulated null distribution.

Fst-heterozygosity distribution for samples grouped by ocean basin. Red open circles indicate loci significantly outside the simulated null distribution.

A total of 329 loci were flagged as outlier using a corrected p-value < 0.05.

Genome-wide distribution of Fst values; loci flagged as outlier by Arlequin are indicated in red.

Fst estimated for 4122 loci, 1387 loci were mapped on the linkage map.

3.2 Gulf individuals

Bayesian maximum likelihood (multinomial-Dirichlet model)

Parameters used for BayeScan run

##  [1] "There are 4107 loci."                        
##  [2] ""                                            
##  [3] "There are 5 populations."                    
##  [4] ""                                            
##  [5] "Burn in: 50000"                              
##  [6] "Thining interval: 50"                        
##  [7] "Sample size: 10000"                          
##  [8] "Resulting total number of iterations: 550000"
##  [9] "Nb of pilot runs: 25"                        
## [10] "Length of each pilot run: 5000"              
## [11] ""

Use q-value to determine if a locus is a good candidate for locus being under the influence of selection.

Comparison of log10(qvalue) and Fst per locus.

Comparison of log10(qvalue) and Fst per locus.

no significant outlier detected

FDIST method (Arlequin)

Fst-heterozygosity distribution.

Fst-heterozygosity distribution.

no significant outlier detected

3.3 Atlantic individuals

Bayesian maximum likelihood (multinomial-Dirichlet model)

no significant outlier detected

FDIST method (arlequin)

no significant outlier detected

4 Assessment of population structure and genetic diversity

Loci were subdivided into two data sets: outlier loci (consensus loci identified by both outlier detection methods as putatively under directional selection) and neutral loci (all remaining loci). Only samples from estuaries with ≥18 individuals were used in the analysis.

4.1 AMOVA (hierarchical F-statistics)

Hierarchical, locus-by-locus analysis of molecular variance (AMOVA), as implemented in Arlequin, was used to test homogeneity of each component both between oceans and among estuaries within ocean basins. Divergence in neutral loci within each ocean basin was explored further by single-level AMOVA. For each AMOVA, significance for each component of variation was assessed by permuting individuals between groups 10,000 times.


# AMOVA (estuaries within ocean basins) & pairwise Fst neutral
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_neutral_all.ars


# AMOVA (estuaries within ocean basins) & pairwise Fst outlier
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_outlier_all.ars


# single level AMOVA Gulf (neutral)
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_neutral_Gulf.ars


# single level AMOVA Atlantic (neutral)
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_neutral_Atl.ars

Copy results files (*xml) into /results/ directory and rename; delete results directory. Extract AMOVA and pairwise \(F_{ST}\) results from xml file and format into tab-delimted file.

Locus-by-locus AMOVA using neutral loci indicating variance partitioning and significance of each component.

Source of variation Percentage Variation Average F-statistic over all loci p-value
among oceans 4.13479 0.04135 <0.0001
among estuaries w/in oceans 0.15511 0.00162 0.00931
among individuals w/in estuaries 95.71010 0.04290 <0.0001

Locus-by-locus AMOVA using outlier loci indicating variance partitioning and significance of each component.

Source of variation Percentage Variation Average F-statistic over all loci p-value
among oceans 32.75179 0.32752 <0.0001
among estuaries w/in oceans 0.13319 0.00198 0.60198
among individuals w/in estuaries 67.11502 0.32885 <0.0001

Compare global \(F_{ST}\) (single level AMOVA results) for individuals within Gulf and Atlantic basins.

Single-level AMOVA (locus-by-locus) using neutral loci indicating variance partitioning and significance of each component for Gulf individuals.
Source of variation Percentage Variation Average F-statistic over all loci p-value
among estuaries 0.14665 0.00147 0.027
within estuaries 99.85335 NA NA
Single-level AMOVA (locus-by-locus) using neutral loci indicating variance partitioning and significance of each component for Atlantic individuals.
Source of variation Percentage Variation Average F-statistic over all loci p-value
among estuaries 0.26015 0.0026 0.098
within estuaries 99.73985 NA NA

4.2 Pairwise Fst

Pairwise estimates of \(F_{ST}\) for each component was generated in Arlequin as a post hoc test for homogeneity between estuaries. Significance was determined by permuting individuals between groups 10,000 times. P-values were corrected for multiple comparisons assuming an FDR of 0.05.

Comparison of pairwise Fst (above the diagonal) and level of significance (below the diagonal) between all pairs of estuaries in the Gulf and Atlantic using neutral loci only. * indicates significant Fst-values.

FST SA SL BB WMS MB AP SJR CH WB
SA - 0.002 0.00109 0.00141 0.0011 0.00169* 0.0466* 0.04237* 0.04437*
SL 0.03297 - 0.00102 0.00105 0.00042 0.00106 0.0452* 0.04114* 0.04279*
BB 0.21018 0.32442 - 0.00067 0.00076 0.00079 0.04627* 0.04213* 0.04391*
WMS 0.06445 0.37571 0.38046 - 0.00045 0.00056 0.04629* 0.04207* 0.04416*
MB 0.12533 0.90229 0.08762 0.66033 - 0.00062 0.04509* 0.04059* 0.04284*
AP 0.00406 0.28472 0.14801 0.56796 0.20097 - 0.04573* 0.0414* 0.04325*
SJR 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 - 0.00104 0.00114
CH 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.44362 - 0.00114
WB 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.604305 0.14692 -

Pairwise Fst, p-value, and adjusted p-value for all pairs of estuaries in the Gulf and Atlantic using neutral loci only.

pop1 pop2 Fst p_val p_adj
SL MB 0.00042 0.902290 0.9022900
WMS MB 0.00045 0.660330 0.6791966
SJR WB 0.00114 0.604305 0.6398524
WMS AP 0.00056 0.567960 0.6195927
SJR CH 0.00104 0.443620 0.4990725
SL WMS 0.00105 0.375710 0.4418245
BB WMS 0.00067 0.380460 0.4418245
SL BB 0.00102 0.324420 0.4027283
SL AP 0.00106 0.284720 0.3660686
SA BB 0.00109 0.210180 0.2802400
MB AP 0.00062 0.200970 0.2782662
BB AP 0.00079 0.148010 0.2131344
CH WB 0.00114 0.146920 0.2131344
SA MB 0.00110 0.125330 0.1961687
BB MB 0.00076 0.087620 0.1433782
SA WMS 0.00141 0.064450 0.1104857
SA SL 0.00200 0.032970 0.0593460
SA AP 0.00169 0.004060 0.0076926
SA SJR 0.04660 0.000000 0.0000000
SA CH 0.04237 0.000000 0.0000000
SA WB 0.04437 0.000000 0.0000000
SL SJR 0.04520 0.000000 0.0000000
SL CH 0.04114 0.000000 0.0000000
SL WB 0.04279 0.000000 0.0000000
BB SJR 0.04627 0.000000 0.0000000
BB CH 0.04213 0.000000 0.0000000
BB WB 0.04391 0.000000 0.0000000
WMS SJR 0.04629 0.000000 0.0000000
WMS CH 0.04207 0.000000 0.0000000
WMS WB 0.04416 0.000000 0.0000000
MB SJR 0.04509 0.000000 0.0000000
MB CH 0.04059 0.000000 0.0000000
MB WB 0.04284 0.000000 0.0000000
AP SJR 0.04573 0.000000 0.0000000
AP CH 0.04140 0.000000 0.0000000
AP WB 0.04325 0.000000 0.0000000

Comparison of pairwise Fst (above the diagonal) and level of significance (below the diagonal) between all pairs of estuaries in the Gulf and Atlantic using outlier loci only. * indicates significant Fst-values.

FST SA SL BB WMS MB AP SJR CH WB
SA - -0.00112 0.00192 0.00468 -0.00028 0.00215 0.33993* 0.3419* 0.30559*
SL 0.87476 - 0.00062 0.00502 0.00177 0.00145 0.34198* 0.3445* 0.30632*
BB 0.46273 0.68815 - 0.00088 0.00172 0.00268 0.34807* 0.34365* 0.31019*
WMS 0.15038 0.12741 0.59954 - 0.0019 0.00357 0.32652* 0.32446* 0.29069*
MB 0.77893 0.36264 0.26572 0.26195 - 0.00152 0.34004* 0.33477* 0.30326*
AP 0.39481 0.5244 0.19206 0.11633 0.28552 - 0.34818* 0.34395* 0.31226*
SJR 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 - -0.00775 0.00085
CH 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.99287 - -0.00037
WB 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.60271 0.62142 -

Pairwise Fst, p-value, and adjusted p-value for all pairs of estuaries in the Gulf and Atlantic using outlier loci only.

pop1 pop2 Fst p_val p_adj
SJR CH -0.00775 0.99287 0.9928700
SA SL -0.00112 0.87476 0.8997531
SA MB -0.00028 0.77893 0.8247494
SL BB 0.00062 0.68815 0.7507091
BB WMS 0.00088 0.59954 0.6990975
SJR WB 0.00085 0.60271 0.6990975
CH WB -0.00037 0.62142 0.6990975
SL AP 0.00145 0.52440 0.6509793
SA BB 0.00192 0.46273 0.5949386
SA AP 0.00215 0.39481 0.5264133
SL MB 0.00177 0.36264 0.5021169
MB AP 0.00152 0.28552 0.4111488
BB MB 0.00172 0.26572 0.3985800
WMS MB 0.00190 0.26195 0.3985800
BB AP 0.00268 0.19206 0.3142800
SA WMS 0.00468 0.15038 0.2577943
SL WMS 0.00502 0.12741 0.2293380
WMS AP 0.00357 0.11633 0.2204147
SA SJR 0.33993 0.00000 0.0000000
SA CH 0.34190 0.00000 0.0000000
SA WB 0.30559 0.00000 0.0000000
SL SJR 0.34198 0.00000 0.0000000
SL CH 0.34450 0.00000 0.0000000
SL WB 0.30632 0.00000 0.0000000
BB SJR 0.34807 0.00000 0.0000000
BB CH 0.34365 0.00000 0.0000000
BB WB 0.31019 0.00000 0.0000000
WMS SJR 0.32652 0.00000 0.0000000
WMS CH 0.32446 0.00000 0.0000000
WMS WB 0.29069 0.00000 0.0000000
MB SJR 0.34004 0.00000 0.0000000
MB CH 0.33477 0.00000 0.0000000
MB WB 0.30326 0.00000 0.0000000
AP SJR 0.34818 0.00000 0.0000000
AP CH 0.34395 0.00000 0.0000000
AP WB 0.31226 0.00000 0.0000000

5 Comparison of genetic diversity among estuaries

Genomic diversity was estimated for each estuary (N > 18), using Nei’s gene diversity (Nei 1973Nei, Masatoshi. 1973. “Analysis of Gene Diversity in Subdivided Populations.” Proceedings of the National Academy of Sciences 70 (12): 3321–3. https://doi.org/10.1073/pnas.70.12.3321.), rarefied allele counts, and evenness. The latter is a measure of the distribution of allele frequencies and was estimated as the ratio of the Stoddart & Taylor index (diversity weighted for more abundant alleles) and the Shannon-Wiener index (diversity weighted for rarer alleles), as implemented in poppr (Kamvar, Tabima, and Grünwald 2014Kamvar, Zhian N., Javier F. Tabima, and Niklaus J. Grünwald. 2014. “Poppr : an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction.” PeerJ 2 (March): e281. https://doi.org/10.7717/peerj.281.). For each measure of diversity, a Friedman’s rank sum test was used to test homogeneity among estuaries. A Wilcoxon signed-rank test was used post hoc to test for pairwise differences between estuaries; p-values were corrected for multiple comparisons, using a sequential Bonferroni approach (Rice 1989). The number of fixed loci was documented for each region and estuary, using rarefied allele counts.

5.1 Nei’s gene diversity (expected heterozygosity)

Compare patterns of gene diversity across loci within estuaries.

Distribution of Nei’s gene diversity (Hs) per locus among individuals grouped by estuary.

Distribution of Nei's gene diversity (Hs) per locus among individuals grouped by estuary.

Test for significant differences among estuaries using Friedman’s test.

# remove loci with Na values
rm <- read_delim("results/gendiv.locstats", delim = "\t") %>%
  select(GRP, LOCUS, Hs) %>%
  filter(GRP %in% levels_estuaries3) %>%
  mutate(GRP = ordered(GRP, levels = levels_estuaries3)) %>%
  filter(is.na(Hs))

temp <- het %>%
  filter(!LOCUS %in% rm$LOCUS)

friedman.test(Hs ~ GRP | LOCUS, data = temp)
## 
##  Friedman rank sum test
## 
## data:  Hs and GRP and LOCUS
## Friedman chi-squared = 158.21, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test; tests symmetry of numeric repeated measurements (statistic per locus) in block design.

# remove loci with Na values
rm <- het %>%
  filter(is.na(Hs))

het <- het %>%
  filter(!LOCUS %in% rm$LOCUS)

# groups to compare
comp <- as.character(unique(het$GRP))

# pairs of comparisons
# pairs <- combn(comp, 2, simplify = FALSE)

pairs <- expand.grid(comp, comp) %>%
  filter(!Var1 == Var2) %>%
  rownames_to_column("PAIR") %>%
  split(.$PAIR) %>%
  purrr::map(function(x){
    x %>%
      select(-PAIR) %>%
      gather(key = temp, value = GRP, 1:2) %>%
      select(-temp)
  })

# empty data frame for results
results <- setNames(data.frame(matrix(ncol = 4, nrow = 0)), 
                    c("pop1", "pop2", "stat", "p.value")) %>%
  mutate(pop1 = as.character(pop1),
         pop2 = as.character(pop2),
         stat = as.numeric(stat),
         p.value = as.numeric(p.value))

n <- as.numeric(length(pairs))

# loop over pairs
for(p in 1:length(pairs)){

  pair <- pairs[[p]]$GRP

  temp <- het %>%
    filter(GRP %in% pair) %>%
    mutate(GRP = ordered(GRP, levels = pair),
         LOCUS = as.factor(LOCUS)) %>%
  droplevels()

  wilcox <- wilcoxsign_test(Hs ~ GRP | LOCUS,
                data = temp,
                zero.method = "Pratt")

  df <- data.frame("pop1" = pair[1],
                   "pop2" = pair[2],
                   "stat" = as.numeric(wilcox@statistic@teststatistic),
                   "p-value" = as.numeric(pvalue(wilcox)))

  results <- bind_rows(results, df)

}

results <- results %>%
  mutate(p_adj = p.adjust(p.value, "BH")) %>%
  mutate(pop1 = ordered(pop1, levels = levels_estuaries3),
         pop2 = ordered(pop2, levels = levels_estuaries3)) %>%
  mutate(OCEAN1 = ifelse(pop1 %in% c("WinyahBay", "CharlestonHarbor", "StJohnsRiver"), "ATL", "GULF"),
         OCEAN2 = ifelse(pop2 %in% c("WinyahBay", "CharlestonHarbor", "StJohnsRiver"), "ATL", "GULF"),
         p.value = round(p.value, digits = 5),
         p_adj = round(p_adj, digits = 5)) %>%
  unite(COMP, OCEAN1, OCEAN2, sep = "-")

Compare significance of pairwise differences.

Level of significance of pairwise comparisons of distribution of Nei’s gene diversity across loci for individuals samples in a given estuary.

Level of significance of pairwise comparisons of distribution of Nei's gene diversity across loci for individuals samples in a given estuary.

Test statistic, p-value and adjusted p-value of pairwise comparisons of Nei’s gene diversity.

pop1 pop2 stat p.value p_adj COMP
WinyahBay StJohnsRiver 0.00 0.99938 0.99938 ATL-ATL
ApalachicolaBay MobileBay 0.07 0.94052 0.96739 GULF-GULF
SanAntonioBay ApalachicolaBay 0.26 0.79390 0.84060 GULF-GULF
ApalachicolaBay BaratariaBay 0.57 0.56670 0.61822 GULF-GULF
MobileBay BaratariaBay 0.66 0.50644 0.56974 GULF-GULF
BaratariaBay WestMississippiSound 0.88 0.37688 0.43767 GULF-GULF
SanAntonioBay BaratariaBay 0.99 0.31997 0.38941 GULF-GULF
SanAntonioBay MobileBay 0.99 0.32450 0.38941 GULF-GULF
CharlestonHarbor WinyahBay 1.11 0.26723 0.34358 ATL-ATL
ApalachicolaBay WestMississippiSound 1.46 0.14337 0.19116 GULF-GULF
MobileBay WestMississippiSound 1.55 0.12196 0.16886 GULF-GULF
CharlestonHarbor StJohnsRiver 1.73 0.08397 0.12092 ATL-ATL
SanAntonioBay WestMississippiSound 2.02 0.04308 0.06462 GULF-GULF
WestMississippiSound CharlestonHarbor 2.73 0.00640 0.01002 GULF-ATL
WestMississippiSound StJohnsRiver 2.84 0.00455 0.00745 GULF-ATL
SabineLake SanAntonioBay 2.92 0.00352 0.00603 GULF-GULF
BaratariaBay CharlestonHarbor 2.98 0.00291 0.00524 GULF-ATL
MobileBay CharlestonHarbor 3.48 0.00050 0.00095 GULF-ATL
ApalachicolaBay CharlestonHarbor 3.48 0.00050 0.00095 GULF-ATL
SanAntonioBay CharlestonHarbor 3.61 0.00030 0.00064 GULF-ATL
BaratariaBay StJohnsRiver 3.80 0.00015 0.00033 GULF-ATL
SabineLake ApalachicolaBay 3.81 0.00014 0.00033 GULF-GULF
WestMississippiSound WinyahBay 3.95 0.00008 0.00020 GULF-ATL
SabineLake MobileBay 3.98 0.00007 0.00019 GULF-GULF
ApalachicolaBay StJohnsRiver 4.12 0.00004 0.00012 GULF-ATL
MobileBay StJohnsRiver 4.17 0.00003 0.00010 GULF-ATL
BaratariaBay WinyahBay 4.28 0.00002 0.00007 GULF-ATL
SabineLake BaratariaBay 4.34 0.00001 0.00006 GULF-GULF
SanAntonioBay StJohnsRiver 4.35 0.00001 0.00006 GULF-ATL
SanAntonioBay WinyahBay 4.94 0.00000 0.00000 GULF-ATL
ApalachicolaBay WinyahBay 5.11 0.00000 0.00000 GULF-ATL
MobileBay WinyahBay 5.19 0.00000 0.00000 GULF-ATL
SabineLake WestMississippiSound 5.24 0.00000 0.00000 GULF-GULF
SabineLake CharlestonHarbor 5.68 0.00000 0.00000 GULF-ATL
SabineLake StJohnsRiver 5.93 0.00000 0.00000 GULF-ATL
SabineLake WinyahBay 6.35 0.00000 0.00000 GULF-ATL

5.2 Rarefied allele counts

# overall
setPop(gen) <- ~OVERALL

dat <- hierfstat:::.genind2hierfstat(gen)

ar <- allelic.richness(dat,
                       diploid = TRUE)

ar <- as.data.frame(ar$Ar) %>%
  rownames_to_column("LOCUS") %>%
  rename(ALL = `1`)

# by ocean
setPop(gen) <- ~OCEAN

dat <- hierfstat:::.genind2hierfstat(gen)

df <- allelic.richness(dat,
                       diploid = TRUE)

df <- as.data.frame(df$Ar) %>%
  rownames_to_column("LOCUS") %>%
  rename(GULF = `1`,
         ATL = `2`)

ar <- left_join(ar, df)

# by region
setPop(gen) <- ~REGION

dat <- hierfstat:::.genind2hierfstat(gen)

df <- allelic.richness(dat,
                       diploid = TRUE)

df <- as.data.frame(df$Ar) %>%
  rownames_to_column("LOCUS") %>%
  rename(CGULF = `1`,
         EGULF = `2`,
         SWATL = `3`,
         MATL = `4`,
         WGULF = `5`)

ar <- left_join(ar, df)

# by estuary
setPop(gen) <- ~ESTUARY

gen_est <- seppop(gen)

temp <- gen_est[c("SanAntonioBay", "SabineLake",
                  "BaratariaBay", "WestMississippiSound", "MobileBay",
                  "ApalachicolaBay",
                  "StJohnsRiver", "CharlestonHarbor", "WinyahBay")]

temp <- repool(temp)

setPop(temp) <- ~ESTUARY

dat <- hierfstat:::.genind2hierfstat(temp)

df <- allelic.richness(dat,
                       diploid = TRUE)

df <- as.data.frame(df$Ar) %>%
  rownames_to_column("LOCUS") %>%
  rename(SanAntonioBay = `1`,
         SabineLake = `2`,
         BaratariaBay = `3`,
         WestMississippiSound = `4`,
         MobileBay = `5`,
         ApalachicolaBay = `6`,
         StJohnsRiver = `7`,
         CharlestonHarbor = `8`,
         WinyahBay = `9`)

ar <- left_join(ar, df)

write_delim(ar, "results/rarefied.allelecount", delim = "\t")

Calculate allelic richness corrected for sample size using rarefaction.

Distribution of rarefied allele counts per estuary.

Distribution of rarefied allele counts per estuary.

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  AR and pop and LOCUS
## Friedman chi-squared = 1536.5, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test to identify patterns of significant differentiation.

Level of significance of pairwise comparisons of distribution of rarefied allelecounts across loci for individuals samples in a given estuary.

Level of significance of pairwise comparisons of distribution of rarefied allelecounts across loci for individuals samples in a given estuary.

Test statistic, p-value, and adjusted p-value of pairwise comparisons of rarefied allele count.

pop1 pop2 stat p.value p_adj COMP
ApalachicolaBay MobileBay 0.05 0.95652 0.95652 GULF-GULF
StJohnsRiver WinyahBay 0.27 0.78655 0.80903 ATL-ATL
SanAntonioBay WestMississippiSound 0.67 0.49982 0.52922 GULF-GULF
BaratariaBay SanAntonioBay 0.81 0.41988 0.45805 GULF-GULF
MobileBay BaratariaBay 0.88 0.37780 0.42502 GULF-GULF
ApalachicolaBay BaratariaBay 1.26 0.20901 0.24272 GULF-GULF
MobileBay SanAntonioBay 1.92 0.05451 0.06541 GULF-GULF
ApalachicolaBay SanAntonioBay 1.97 0.04914 0.06100 GULF-GULF
BaratariaBay WestMississippiSound 2.07 0.03887 0.04998 GULF-GULF
ApalachicolaBay WestMississippiSound 2.85 0.00436 0.00581 GULF-GULF
CharlestonHarbor WinyahBay 3.03 0.00245 0.00339 ATL-ATL
CharlestonHarbor StJohnsRiver 3.09 0.00197 0.00284 ATL-ATL
MobileBay WestMississippiSound 3.21 0.00133 0.00199 GULF-GULF
SabineLake ApalachicolaBay 3.85 0.00012 0.00018 GULF-GULF
SabineLake MobileBay 4.43 0.00001 0.00002 GULF-GULF
SabineLake BaratariaBay 4.92 0.00000 0.00000 GULF-GULF
SabineLake SanAntonioBay 5.10 0.00000 0.00000 GULF-GULF
SabineLake WestMississippiSound 6.27 0.00000 0.00000 GULF-GULF
WestMississippiSound CharlestonHarbor 17.79 0.00000 0.00000 GULF-ATL
SanAntonioBay CharlestonHarbor 17.85 0.00000 0.00000 GULF-ATL
BaratariaBay CharlestonHarbor 18.82 0.00000 0.00000 GULF-ATL
SanAntonioBay StJohnsRiver 18.92 0.00000 0.00000 GULF-ATL
WestMississippiSound StJohnsRiver 19.04 0.00000 0.00000 GULF-ATL
SanAntonioBay WinyahBay 19.90 0.00000 0.00000 GULF-ATL
ApalachicolaBay CharlestonHarbor 20.03 0.00000 0.00000 GULF-ATL
WestMississippiSound WinyahBay 20.08 0.00000 0.00000 GULF-ATL
MobileBay CharlestonHarbor 20.26 0.00000 0.00000 GULF-ATL
BaratariaBay StJohnsRiver 20.45 0.00000 0.00000 GULF-ATL
SabineLake CharlestonHarbor 20.81 0.00000 0.00000 GULF-ATL
ApalachicolaBay StJohnsRiver 21.34 0.00000 0.00000 GULF-ATL
BaratariaBay WinyahBay 21.40 0.00000 0.00000 GULF-ATL
MobileBay StJohnsRiver 21.79 0.00000 0.00000 GULF-ATL
SabineLake StJohnsRiver 22.17 0.00000 0.00000 GULF-ATL
ApalachicolaBay WinyahBay 22.33 0.00000 0.00000 GULF-ATL
MobileBay WinyahBay 22.75 0.00000 0.00000 GULF-ATL
SabineLake WinyahBay 22.82 0.00000 0.00000 GULF-ATL

5.3 Evenness

Determine the ratio of the number of abundant genotypes to the number of rarer genotypes calculated using the ratio of Stoddart & Tayolor index (diversity index weighted for more abundant alleles) & Shannon-Wiener index (diversity index weighted for more rare alleles).

Distribution of evenness per estuary.

Distribution of evenness per estuary.

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  EVENNESS and GRP and LOCUS
## Friedman chi-squared = 3842.3, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test to identify patterns of significant differences among estuaries.

Level of significance of pairwise comparisons of distribution of allele diversity measured as evenness across loci for individuals samples in a given estuary.

Level of significance of pairwise comparisons of distribution of allele diversity measured as evenness across loci for individuals samples in a given estuary.

Test statistic, p-value and adjusted p-value of pairwise comparisons of evenness.

pop1 pop2 stat p.value p_adj COMP
WinyahBay CharlestonHarbor 1.35 0.17690 0.17690 ATL-ATL
CharlestonHarbor StJohnsRiver 2.40 0.01653 0.01700 ATL-ATL
BaratariaBay ApalachicolaBay 3.38 0.00073 0.00078 GULF-GULF
WinyahBay StJohnsRiver 3.52 0.00042 0.00046 ATL-ATL
SanAntonioBay SabineLake 4.43 0.00001 0.00001 GULF-GULF
WestMississippiSound BaratariaBay 5.13 0.00000 0.00000 GULF-GULF
SabineLake WestMississippiSound 8.41 0.00000 0.00000 GULF-GULF
WestMississippiSound ApalachicolaBay 8.79 0.00000 0.00000 GULF-GULF
ApalachicolaBay MobileBay 9.53 0.00000 0.00000 GULF-GULF
BaratariaBay MobileBay 12.42 0.00000 0.00000 GULF-GULF
SanAntonioBay WestMississippiSound 12.52 0.00000 0.00000 GULF-GULF
SabineLake BaratariaBay 12.75 0.00000 0.00000 GULF-GULF
SabineLake ApalachicolaBay 15.66 0.00000 0.00000 GULF-GULF
StJohnsRiver SanAntonioBay 17.23 0.00000 0.00000 ATL-GULF
SanAntonioBay BaratariaBay 17.40 0.00000 0.00000 GULF-GULF
WestMississippiSound MobileBay 17.57 0.00000 0.00000 GULF-GULF
CharlestonHarbor SanAntonioBay 19.40 0.00000 0.00000 ATL-GULF
SanAntonioBay ApalachicolaBay 19.89 0.00000 0.00000 GULF-GULF
WinyahBay SanAntonioBay 19.99 0.00000 0.00000 ATL-GULF
StJohnsRiver SabineLake 20.15 0.00000 0.00000 ATL-GULF
CharlestonHarbor SabineLake 21.79 0.00000 0.00000 ATL-GULF
WinyahBay SabineLake 22.65 0.00000 0.00000 ATL-GULF
SabineLake MobileBay 23.81 0.00000 0.00000 GULF-GULF
StJohnsRiver WestMississippiSound 24.84 0.00000 0.00000 ATL-GULF
CharlestonHarbor WestMississippiSound 26.42 0.00000 0.00000 ATL-GULF
WinyahBay WestMississippiSound 26.85 0.00000 0.00000 ATL-GULF
SanAntonioBay MobileBay 26.88 0.00000 0.00000 GULF-GULF
StJohnsRiver BaratariaBay 27.20 0.00000 0.00000 ATL-GULF
StJohnsRiver ApalachicolaBay 28.52 0.00000 0.00000 ATL-GULF
CharlestonHarbor BaratariaBay 28.59 0.00000 0.00000 ATL-GULF
WinyahBay BaratariaBay 29.19 0.00000 0.00000 ATL-GULF
CharlestonHarbor ApalachicolaBay 30.11 0.00000 0.00000 ATL-GULF
WinyahBay ApalachicolaBay 30.54 0.00000 0.00000 ATL-GULF
StJohnsRiver MobileBay 31.82 0.00000 0.00000 ATL-GULF
CharlestonHarbor MobileBay 33.25 0.00000 0.00000 ATL-GULF
WinyahBay MobileBay 33.80 0.00000 0.00000 ATL-GULF

5.4 Fixed alleles

Identify number of fixed loci within sample locations.

Number of fixed alleles in the Atlantic compared to the Gulf.

GRP n
ATL 235
GULF 15

Number of fixed alleles per estuary with > 18 individuals based on rarefied allelic richness (estuaries sorted in descending order by number of fixed alleles).

GRP n
WinyahBay 513
StJohnsRiver 454
CharlestonHarbor 440
SanAntonioBay 304
SabineLake 250
WestMississippiSound 217
BaratariaBay 173
ApalachicolaBay 145
MobileBay 106

Compare fixed loci across sampled estuaries in the Gulf and Atlantic. The set size (horizontal green bars) indicates the total number of loci fixed in a given location, the intersection size (vertical orange bars) indicates the number of loci fixed only in a single location (single blue dot) or in two, three, or four locations (indicated by blue dots connected by line).

Compare global allelelic richness per locus for loci fixed in each ocean basin.

Distribution of global allele rarefied allele counts for loci fixed one of the ocean basins.

Distribution of global allele rarefied allele counts for loci fixed one of the ocean basins.

Compare global allelelic richness per locus for loci that are fixed each sample location.

Distribution of global rarefied allele counts across all individuals for loci fixed in each estuary.

Distribution of global rarefied allele counts across all individuals for loci fixed in each estuary.

In general, loci fixed in the Gulf are less variable across all individuals compared to loci fixed in the Atlantic with have higher allelic richness in the Gulf.

6 Tests of neutrality at the genome and locus level

Tajima’s D, Watterson’s estimator, and nucleotide diversity was calculated for each locus as implemented in pegas (Paradis 2010Paradis, Emmanuel. 2010. “Pegas: An R package for population genetics with an integrated-modular approach.” Oxford University Press. https://doi.org/10.1093/bioinformatics/btp696.).

Get sequence information for loci by reading in reduced representation reference. Create a tidy data set of variant call information (position, alternate alleles) from filtered VCF file used to create haplotypes. Create a data frame of haplotype sequences in the data set(s) using reference contig sequences, VCF information, and codes.out file from rad_haplotyper. Calculate haplotype-based measures, including haplotype diversity, nucleotide diversity, number of segregating sites, and Tajima’s D (test latter for significance).

# Read in the reference sequence
fa <- seqinr::read.fasta("data/REF/reference.fasta")

# create tibble with locus name and sequence
ref <- tibble::tibble(name = seqinr::getName(fa), seq = toupper(unlist(seqinr::getSequence(fa, as.string = TRUE))))

# Read the haplotype VCF file and put it in tidy format
vcf_tidy <- read.vcfR("data/VCF/temp/SFL.recode.vcf") %>%
  vcfR2tidy()

# Get the data frame with the positions
pos_tbl <- vcf_tidy$fix %>%
  filter(ALT %in% c("A", "T", "C", "G"))

# genind object to use
setPop(gen) <- ~ESTUARY

gen_obj <- gen

# Make a data frame with all of the haplotype sequences from the 'codes.out' file from rad_haplotyper
hap_index <- read_tsv("data/Haplotyping/codes.SFL.gen", col_names = FALSE) %>%
  filter(X1 %in% locNames(gen_obj)) %>%
  split(.$X1) %>%
  purrr::map(function(x) {
    codes <- str_split(x$X2, ",") %>%
      unlist()

    code_tbl <- tibble(locus = x$X1, code = codes) %>%
      separate(code, c("hap", "code"), sep = ":") %>%
      left_join(ref, by = c("locus" = "name"))

    pos <- pos_tbl %>%
      filter(CHROM == code_tbl$locus[1]) %>%
      pull(POS)

    for (i in 1:nrow(code_tbl)) {

      alleles <- str_split(code_tbl$hap[i], "") %>%
        unlist()

      replace <- tibble(pos = pos, allele = alleles)

      for (j in 1:nrow(replace)) {
        str_sub(code_tbl$seq[i], replace$pos[j], 1) <- replace$allele[j]
      }

    }

    code_tbl

  }) %>%
  bind_rows()


# set pops
setPop(gen_obj) <- ~ESTUARY

# create tidy data set
gen_tidy <- gen_obj %>%
  genind2df(oneColPerAll = TRUE) %>%
  rownames_to_column(var = "ind") %>%
  gather(allele, code, -pop, -ind) %>%
  separate(allele, c("locus", "allele"), sep = "\\.") %>%
  filter(pop %in% c("SanAntonioBay", "SabineLake",
                  "BaratariaBay", "WestMississippiSound", "MobileBay",
                  "ApalachicolaBay",
                  "StJohnsRiver", "CharlestonHarbor", "WinyahBay")) %>%
  droplevels()

# Calculate haplotype related stats
hap_div_tbl <- locNames(gen_obj) %>%
  purrr::map(function(y) {

    # for each locus
    gen_tidy %>%
      filter(locus == y) %>%
      filter(!code == "NA") %>%
      left_join(hap_index, by = c("code", "locus")) %>%
      as.tibble() %>%
      arrange(factor(pop, levels = c("SanAntonioBay", "SabineLake",
                                     "BaratariaBay", "WestMississippiSound", "MobileBay",
                                     "ApalachicolaBay",
                                     "StJohnsRiver", "CharlestonHarbor", "WinyahBay"))) %>%

      # for each population per locus
      split(.$pop) %>%
      purrr::map(function(x) {
        y <- do.call(rbind, strsplit(x$seq, ""))
        rownames(y) <- paste(x$ind, x$allele, sep = ".")

        # create DNAbin object for locus
        dna_bin <- as.DNAbin(y)                            

        # calculate Tajima's D and p-values
        taj_test <- tajima.test(dna_bin)

        # create table with results
        tibble(locus = x$locus[1],
               pop = x$pop[1],
               hap_div = hap.div(dna_bin),                     # calculate haplotype diversity
               nuc_div = nuc.div(dna_bin),                     # calculate nucleotide diversity (pi)
               seg_sites = length(seg.sites(dna_bin)),         # count number of segregating sites
               thetaS = theta.s(x = seg_sites, n = nLoc(gen)), # calculate theta S (watterson 1975)
               tajima_d = taj_test$D,                          # extract Tajima's D
               tajima_pval_norm = taj_test$Pval.normal,        # extract p-value Tajima's D (normal distribution)
               tajima_pval_beta = taj_test$Pval.beta)          # extract p-value Tajima's D (beta distribution)
      }) %>%
      bind_rows()
  }) %>%
    bind_rows()

write_delim(hap_div_tbl, "results/hap_div_est", delim = "\t")

Genome-wide, null distributions of Tajima’s D were simulated for each estuary, using a coalescent model as executed in MS (Hudson 2002Hudson, Richard R. 2002. “Generating samples under a Wright-Fisher neutral model of genetic variation.” Bioinformatics 18 (2): 337–38. https://doi.org/10.1093/bioinformatics/18.2.337.), to assess whether the observed genome-wide distribution of Tajima’s D values for each estuary was consistent with a neutral, drift-mutation equilibrium. For each simulation a set of neutral loci that consisted of the same number of loci with the same distribution of segregating sites as in the observed data (grouper by estuary) was generated. The difference between empirical and simulated distributions in each estuary was then assessed by generating 1,000 simulated null distributions of Tajima’s D for each estuary and comparing the mean and median values of the empirical distribution with the mean and median of each of the 1,000 simulated distributions. Significance was assessed by estimating the probability that the observed values fall into the distribution of simulated values (means and medians).

Significance (departure from neutrality, assuming genome is in equilibrium) tested per locus in each location, assuming a beta distribution (Tajima 1989Tajima, F. 1989. “Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.” Genetics 123 (3): 585–595. https://doi.org/10.1534/genetics.107.079319.), as implemented in pegas. The number of significant (P < 0.05) loci, positive and negative, were then summarized by estuary. The distribution of loci with significant Tajima’s D values, both positive and negative, across linkage groups was assessed using those loci that previously were incorporated into a linkage map (O’Leary, et al. 2018).

\(\theta\) is the population-scaled mutation rate and can be estimated as \(\hat{\theta}_T\) (Nei 1987Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.) as the number of pairwise differences (Tajima’s estimator or \(\pi\)), while \(\hat{\theta}_W\) is the number of segregating sites (Watterson’s estimator or \(S\), (Watterson 1975Watterson, G. A. 1975. “On the number of segregating sites in genetical models without recombination.” Theoretical Population Biology 7 (2): 256–76. https://doi.org/10.1016/0040-5809(75)90020-9.) . Because \(\hat{\theta}_T\) will underestimate the number of mutations that are rare in the population, Tajima’s \(D\) can be used to test the neutral mutation hypothesis. A negative Tajima’s \(D\) is indicative of positive selection while for balancing selection, alleles are kept at intermediate frequencies resulting in Tajima’s \(D\) becoming positive. To better understand what was driving patterns in neutrality tests , Θ was calculated based on the number of segregating sites as \(\hat{\theta}_W\) and based on pairwise differences among haplotypes (nucleotide diversity), and the mean and standard deviation compared across estuaries.

6.1 Genome-wide distriubtion

Identify null distribution of Tajima’s \(D\) across a genome for set of neutral loci (same number of loci with same distribution of segregating sites as empirical data set) using loci simuated under coalescence using MS (Hudson 2002Hudson, Richard R. 2002. “Generating samples under a Wright-Fisher neutral model of genetic variation.” Bioinformatics 18 (2): 337–38. https://doi.org/10.1093/bioinformatics/18.2.337.).

SanAntonioBay

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 23 individuals (46 observations) for no population expansion and for population expansion for alpha = 10, 30, 90.

Population size is given by: N(t) = N0 exp(alpha*t), for t = time before the present, measured in units of 4N0 generations.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 46 1000 -s $i | scr/MS/sample_stats > results/Ind23_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 46 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind23_Sites${i}_a${a}_Rep1000.stats

  done

done

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind23")

# create an empty list that will serve as a container to receive the incoming files
sim <-list()

# create a loop to read in your data
for (i in 1:length(filenames)){

  sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
                         col_names = c("t1", "pi",
                                       "t2", "seg_sites",
                                       "t3", "tajima_d",
                                       "t4", "thetaH",
                                       "t5", "H")) %>%
    select(pi, seg_sites, tajima_d, thetaH, H)

}

# add names
names(sim) <- filenames

# combine into dataframe
sim <- ldply(sim, data.frame) %>%
  rename(simulation = `.id`) %>%
  separate(simulation, into = c("temp", "t1"), sep = -6) %>%
  separate(temp, into = c("N_IND", "t2", "ALPHA", "N_REPS"), sep = "_") %>%
  mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
         N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+")),
         ALPHA = as.numeric(str_extract(ALPHA, "[[:digit:]]+"))) %>%
  select(-t1, -t2)

# set number of cores to run in parallel
registerDoMC(50)

reps <- 1000

# group to be compared
grp <- "SanAntonioBay"

# population growth (alpha)
alpha <- c(0, 5, 10, 30, 90)

# get counts for number of segregating sites
emp <- read_delim("results/hap_div_est", delim = "\t") %>%
  select(locus, pop, seg_sites) %>%
  filter(!seg_sites == 0 & seg_sites < 35) %>%
  filter(pop == grp) %>%
  count(seg_sites)

null_distributions <- list()

for(a in alpha) {

  # replicate draws to match distribution in parallel
  results <- foreach(i = 1:reps) %dopar% {
  sim_data <- list()

    for (s in emp$seg_sites) {

      # determine number of loci with s segregating  sites
      t <- emp %>%
        filter(seg_sites == s)

      n_loci <- t$n

     # sample n_loci simulated loci (with replacement)
     sim_data[[s]] <- sim %>%
        filter(ALPHA == a) %>%
        filter(seg_sites == s) %>%
        sample_n(size = n_loci, replace = TRUE) %>%
        mutate(data_set = paste("sim_", as.character(i), sep = ""))

    }

  sim_data <- ldply(sim_data, data.frame) %>%
  select(seg_sites, ALPHA, tajima_d, data_set)

  }

  null_distributions[[as.character(a)]] <- ldply(results, data.frame)

}

null_distributions <- ldply(null_distributions, data.frame) %>%
  select(-`.id`)

write_delim(null_distributions, "results/SA.nulldist", delim = "\t")

Compare empirical and simulated data sets:

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Estimate p-values to determine if parameters describing the distribution of Tajima’s D across the genome is significantly different (smaller/larger) than simulated null distributions.

param <- taj %>%
  group_by(ALPHA, data_set) %>%
  summarise(mean = mean(tajima_d),
            std = sd(tajima_d),
            SE = std.error(tajima_d),
            quantile_10 = quantile(tajima_d, 0.1, na.rm=TRUE),
            quantile_25 = quantile(tajima_d, 0.25, na.rm=TRUE),
            median = median(tajima_d),
            quantile_75 = quantile(tajima_d, 0.75, na.rm=TRUE),
            quantile_90 = quantile(tajima_d, 0.90, na.rm=TRUE)) %>%
  gather(key = PARAM, value = VALUE, 3:10) %>%
  ungroup()


# number of permutations
nperm <- 1000

# parameters to assess p-value for
stats <- unique(param$PARAM)

# values for alpha (population growth parameter)
alpha <- c(0, 5, 10, 30, 90)

# dataframe for results
sign <- setNames(data.frame(matrix(ncol = 6, nrow = 0)), 
                    c("ALPHA", "PARAM", "OBS", "SIM", "PVALsmaller", "PVALlarger")) %>%
  mutate(ALPHA = as.character(ALPHA),
         PARAM = as.character(PARAM),
         OBS = as.numeric(OBS),
         SIM = as.numeric(SIM),
         PVALsmaller = as.numeric(PVALsmaller),
         PVALlarger = as.numeric(PVALlarger))

for(p in stats) {

  for(a in alpha) {

    # observed parameters
    obs <- param %>%
      filter(data_set == "empirical",
             PARAM == p)

    # simulated parameters
    sim <- param %>%
      filter(ALPHA == a,
             PARAM == p)

    # count number of times simulated value < observed value
    obs_value <- obs$VALUE

    smaller <- sim %>%
      filter(VALUE < obs_value) %>%
      nrow()

    # count number of times simulated value > observed value
    obs_value <- obs$VALUE

    larger <- sim %>%
      filter(VALUE > obs_value) %>%
      nrow()

    temp <- obs %>%
      select(PARAM, VALUE) %>%
      rename(OBS = VALUE) %>%
      mutate(SIM = mean(sim$VALUE),
             PVALsmaller = smaller/nperm,
             PVALlarger = larger/nperm,
             ALPHA = as.character(a))

    sign <- bind_rows(sign, temp)

  }

}

kable(
  sign %>%
  arrange(ALPHA),
  caption = "Summary statistics of the distribution of Tajima's D."
)

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.6312303 -0.0567930 0.000 1.000
0 std 0.8072930 0.9690807 0.000 1.000
0 SE 0.0130771 0.0156979 0.000 1.000
0 quantile_10 -1.5194315 -1.1977150 0.000 1.000
0 quantile_25 -1.1840145 -0.8467111 0.000 1.000
0 median -0.8235974 -0.1778511 0.000 1.000
0 quantile_75 -0.1778705 0.5947352 0.000 1.000
0 quantile_90 0.4846081 1.3592651 0.000 1.000
10 mean -0.6312303 -0.6871890 1.000 0.000
10 std 0.8072930 0.7655427 1.000 0.000
10 SE 0.0130771 0.0124008 1.000 0.000
10 quantile_10 -1.5194315 -1.5232487 0.523 0.477
10 quantile_25 -1.1840145 -1.2173933 0.989 0.011
10 median -0.8235974 -0.8523805 0.992 0.008
10 quantile_75 -0.1778705 -0.2593794 1.000 0.000
10 quantile_90 0.4846081 0.3760594 1.000 0.000
30 mean -0.6312303 -0.9780005 1.000 0.000
30 std 0.8072930 0.6730369 1.000 0.000
30 SE 0.0130771 0.0109023 1.000 0.000
30 quantile_10 -1.5194315 -1.7052337 1.000 0.000
30 quantile_25 -1.1840145 -1.4638000 1.000 0.000
30 median -0.8235974 -1.1110330 1.000 0.000
30 quantile_75 -0.1778705 -0.6352838 1.000 0.000
30 quantile_90 0.4846081 -0.0757024 1.000 0.000
5 mean -0.6312303 -0.5342840 0.000 1.000
5 std 0.8072930 0.8039724 0.623 0.377
5 SE 0.0130771 0.0130233 0.623 0.377
5 quantile_10 -1.5194315 -1.4612264 0.000 1.000
5 quantile_25 -1.1840145 -1.1114426 0.000 1.000
5 median -0.8235974 -0.6728194 0.000 1.000
5 quantile_75 -0.1778705 -0.0491834 0.000 1.000
5 quantile_90 0.4846081 0.5833725 0.001 0.999
90 mean -0.6312303 -1.1783319 1.000 0.000
90 std 0.8072930 0.6312627 1.000 0.000
90 SE 0.0130771 0.0102256 1.000 0.000
90 quantile_10 -1.5194315 -1.8533453 1.000 0.000
90 quantile_25 -1.1840145 -1.5840429 1.000 0.000
90 median -0.8235974 -1.2972156 1.000 0.000
90 quantile_75 -0.1778705 -0.8968364 1.000 0.000
90 quantile_90 0.4846081 -0.3890550 1.000 0.000

Sabine Lake

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 24 individuals (48 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 48 1000 -s $i | scr/MS/sample_stats > results/Ind24_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 48 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind24_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.6889841 -0.0785965 0.000 1.000
0 std 0.7920153 0.9618471 0.000 1.000
0 SE 0.0127331 0.0154635 0.000 1.000
0 quantile_10 -1.5665469 -1.2253161 0.000 1.000
0 quantile_25 -1.2405739 -0.8631777 0.000 1.000
0 median -0.8621855 -0.1849718 0.000 1.000
0 quantile_75 -0.2525410 0.5691393 0.000 1.000
0 quantile_90 0.3927034 1.2872139 0.000 1.000
10 mean -0.6889841 -0.7172501 0.991 0.009
10 std 0.7920153 0.7552064 1.000 0.000
10 SE 0.0127331 0.0121413 1.000 0.000
10 quantile_10 -1.5665469 -1.5605983 0.559 0.441
10 quantile_25 -1.2405739 -1.2537852 0.802 0.198
10 median -0.8621855 -0.8672021 0.991 0.009
10 quantile_75 -0.2525410 -0.2926859 0.972 0.028
10 quantile_90 0.3927034 0.3464750 0.983 0.017
30 mean -0.6889841 -0.9844898 1.000 0.000
30 std 0.7920153 0.6741268 1.000 0.000
30 SE 0.0127331 0.0108378 1.000 0.000
30 quantile_10 -1.5665469 -1.7161487 1.000 0.000
30 quantile_25 -1.2405739 -1.4668954 1.000 0.000
30 median -0.8621855 -1.1068650 1.000 0.000
30 quantile_75 -0.2525410 -0.6472442 1.000 0.000
30 quantile_90 0.3927034 -0.0792495 1.000 0.000
5 mean -0.6889841 -0.5435107 0.000 1.000
5 std 0.7920153 0.7996179 0.206 0.794
5 SE 0.0127331 0.0128553 0.206 0.794
5 quantile_10 -1.5665469 -1.4670794 0.000 1.000
5 quantile_25 -1.2405739 -1.1129919 0.000 1.000
5 median -0.8621855 -0.6719187 0.000 1.000
5 quantile_75 -0.2525410 -0.0418688 0.000 1.000
5 quantile_90 0.3927034 0.5586672 0.000 1.000
90 mean -0.6889841 -1.2001224 1.000 0.000
90 std 0.7920153 0.6137250 1.000 0.000
90 SE 0.0127331 0.0098668 1.000 0.000
90 quantile_10 -1.5665469 -1.8655387 1.000 0.000
90 quantile_25 -1.2405739 -1.6084470 1.000 0.000
90 median -0.8621855 -1.3063056 1.000 0.000
90 quantile_75 -0.2525410 -0.9370921 1.000 0.000
90 quantile_90 0.3927034 -0.4157354 1.000 0.000

Barataria Bay

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 39 individuals (78 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 78 1000 -s $i | scr/MS/sample_stats > results/Ind39_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 78 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind39_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.6707489 -0.0636089 0.000 1.000
0 std 0.7885400 0.9522225 0.000 1.000
0 SE 0.0125577 0.0151644 0.000 1.000
0 quantile_10 -1.5262175 -1.1930471 0.000 1.000
0 quantile_25 -1.2360644 -0.8211557 0.000 1.000
0 median -0.8441499 -0.1800690 0.000 1.000
0 quantile_75 -0.2325124 0.5698343 0.000 1.000
0 quantile_90 0.3861367 1.3179328 0.000 1.000
10 mean -0.6707489 -0.7776915 1.000 0.000
10 std 0.7885400 0.7196294 1.000 0.000
10 SE 0.0125577 0.0114603 1.000 0.000
10 quantile_10 -1.5262175 -1.5674669 0.999 0.001
10 quantile_25 -1.2360644 -1.2925806 0.999 0.001
10 median -0.8441499 -0.9201910 1.000 0.000
10 quantile_75 -0.2325124 -0.3926858 1.000 0.000
10 quantile_90 0.3861367 0.2190113 1.000 0.000
30 mean -0.6707489 -1.0267308 1.000 0.000
30 std 0.7885400 0.6473541 1.000 0.000
30 SE 0.0125577 0.0103093 1.000 0.000
30 quantile_10 -1.5262175 -1.7253802 1.000 0.000
30 quantile_25 -1.2360644 -1.4641894 1.000 0.000
30 median -0.8441499 -1.1077578 1.000 0.000
30 quantile_75 -0.2325124 -0.7477087 1.000 0.000
30 quantile_90 0.3861367 -0.1837284 1.000 0.000
5 mean -0.6707489 -0.5837980 0.000 1.000
5 std 0.7885400 0.7795619 0.805 0.195
5 SE 0.0125577 0.0124147 0.805 0.195
5 quantile_10 -1.5262175 -1.4374738 0.000 1.000
5 quantile_25 -1.2360644 -1.1272069 0.000 1.000
5 median -0.8441499 -0.7477897 0.000 1.000
5 quantile_75 -0.2325124 -0.1516918 0.000 1.000
5 quantile_90 0.3861367 0.5134716 0.000 1.000
90 mean -0.6707489 -1.2524127 1.000 0.000
90 std 0.7885400 0.5745890 1.000 0.000
90 SE 0.0125577 0.0091505 1.000 0.000
90 quantile_10 -1.5262175 -1.8825303 1.000 0.000
90 quantile_25 -1.2360644 -1.6479069 1.000 0.000
90 median -0.8441499 -1.3320628 1.000 0.000
90 quantile_75 -0.2325124 -1.0206349 1.000 0.000
90 quantile_90 0.3861367 -0.5704469 1.000 0.000

West Mississippi Sound

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 34 individuals (68 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 68 1000 -s $i | scr/MS/sample_stats > results/Ind34_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 68 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind34_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.6647244 -0.0678521 0.000 1.000
0 std 0.7930950 0.9555513 0.000 1.000
0 SE 0.0127176 0.0153227 0.000 1.000
0 quantile_10 -1.5285687 -1.1962915 0.000 1.000
0 quantile_25 -1.2234043 -0.8215735 0.000 1.000
0 median -0.8347845 -0.1906789 0.000 1.000
0 quantile_75 -0.2189918 0.5547455 0.000 1.000
0 quantile_90 0.4300502 1.3257402 0.000 1.000
10 mean -0.6647244 -0.7509188 1.000 0.000
10 std 0.7930950 0.7298519 1.000 0.000
10 SE 0.0127176 0.0117035 1.000 0.000
10 quantile_10 -1.5285687 -1.5552237 0.994 0.006
10 quantile_25 -1.2234043 -1.2737350 0.998 0.002
10 median -0.8347845 -0.9001907 1.000 0.000
10 quantile_75 -0.2189918 -0.3570284 1.000 0.000
10 quantile_90 0.4300502 0.2557269 1.000 0.000
30 mean -0.6647244 -1.0328754 1.000 0.000
30 std 0.7930950 0.6477537 1.000 0.000
30 SE 0.0127176 0.0103870 1.000 0.000
30 quantile_10 -1.5285687 -1.7289601 1.000 0.000
30 quantile_25 -1.2234043 -1.4757960 1.000 0.000
30 median -0.8347845 -1.1076278 1.000 0.000
30 quantile_75 -0.2189918 -0.7367943 1.000 0.000
30 quantile_90 0.4300502 -0.1739533 1.000 0.000
5 mean -0.6647244 -0.5836971 0.000 1.000
5 std 0.7930950 0.7829934 0.853 0.147
5 SE 0.0127176 0.0125556 0.853 0.147
5 quantile_10 -1.5285687 -1.4581533 0.000 1.000
5 quantile_25 -1.2234043 -1.1342969 0.000 1.000
5 median -0.8347845 -0.7362509 0.000 1.000
5 quantile_75 -0.2189918 -0.1284639 0.000 1.000
5 quantile_90 0.4300502 0.5129576 0.000 1.000
90 mean -0.6647244 -1.2437423 1.000 0.000
90 std 0.7930950 0.5780765 1.000 0.000
90 SE 0.0127176 0.0092697 1.000 0.000
90 quantile_10 -1.5285687 -1.8759689 1.000 0.000
90 quantile_25 -1.2234043 -1.6546031 1.000 0.000
90 median -0.8347845 -1.3208527 1.000 0.000
90 quantile_75 -0.2189918 -1.0033864 1.000 0.000
90 quantile_90 0.4300502 -0.5375963 1.000 0.000

Mobile Bay

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 62 individuals (48 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 124 1000 -s $i | scr/MS/sample_stats > results/Ind62_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 124 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind62_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.6717212 -0.0689478 0.000 1.000
0 std 0.7758586 0.9639020 0.000 1.000
0 SE 0.0122720 0.0152463 0.000 1.000
0 quantile_10 -1.5051528 -1.2088858 0.000 1.000
0 quantile_25 -1.2227716 -0.8319931 0.000 1.000
0 median -0.8432089 -0.1899281 0.000 1.000
0 quantile_75 -0.2830638 0.5427770 0.000 1.000
0 quantile_90 0.3850940 1.3022929 0.000 1.000
10 mean -0.6717212 -0.8015301 1.000 0.000
10 std 0.7758586 0.6954500 1.000 0.000
10 SE 0.0122720 0.0110002 1.000 0.000
10 quantile_10 -1.5051528 -1.5745937 1.000 0.000
10 quantile_25 -1.2227716 -1.2916222 1.000 0.000
10 median -0.8432089 -0.9235151 1.000 0.000
10 quantile_75 -0.2830638 -0.4427216 1.000 0.000
10 quantile_90 0.3850940 0.1487216 1.000 0.000
30 mean -0.6717212 -1.0753747 1.000 0.000
30 std 0.7758586 0.6188381 1.000 0.000
30 SE 0.0122720 0.0097884 1.000 0.000
30 quantile_10 -1.5051528 -1.7487884 1.000 0.000
30 quantile_25 -1.2227716 -1.4994872 1.000 0.000
30 median -0.8432089 -1.1673112 1.000 0.000
30 quantile_75 -0.2830638 -0.8075808 1.000 0.000
30 quantile_90 0.3850940 -0.2927997 1.000 0.000
5 mean -0.6717212 -0.6224272 0.000 1.000
5 std 0.7758586 0.7627542 0.896 0.104
5 SE 0.0122720 0.0120647 0.896 0.104
5 quantile_10 -1.5051528 -1.4563755 0.000 1.000
5 quantile_25 -1.2227716 -1.1662361 0.000 1.000
5 median -0.8432089 -0.7811226 0.000 1.000
5 quantile_75 -0.2830638 -0.2070992 0.000 1.000
5 quantile_90 0.3850940 0.4422204 0.042 0.958
90 mean -0.6717212 -1.2901302 1.000 0.000
90 std 0.7758586 0.5608594 1.000 0.000
90 SE 0.0122720 0.0088713 1.000 0.000
90 quantile_10 -1.5051528 -1.8965090 1.000 0.000
90 quantile_25 -1.2227716 -1.6774317 1.000 0.000
90 median -0.8432089 -1.3533690 1.000 0.000
90 quantile_75 -0.2830638 -1.0039278 1.000 0.000
90 quantile_90 0.3850940 -0.6480351 1.000 0.000

Apalachicola Bay

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 44 individuals (48 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 88 1000 -s $i | scr/MS/sample_stats > results/Ind44_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 88 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind44_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.6661067 -0.0896773 0.000 1.000
0 std 0.7805748 0.9503664 0.000 1.000
0 SE 0.0123995 0.0150966 0.000 1.000
0 quantile_10 -1.4907011 -1.2059490 0.000 1.000
0 quantile_25 -1.2205584 -0.8521527 0.000 1.000
0 median -0.8291125 -0.1987122 0.000 1.000
0 quantile_75 -0.2722850 0.5167723 0.000 1.000
0 quantile_90 0.4035077 1.2789622 0.000 1.000
10 mean -0.6661067 -0.7805077 1.000 0.000
10 std 0.7805748 0.7153168 1.000 0.000
10 SE 0.0123995 0.0113628 1.000 0.000
10 quantile_10 -1.4907011 -1.5656684 1.000 0.000
10 quantile_25 -1.2205584 -1.2950439 1.000 0.000
10 median -0.8291125 -0.9165530 1.000 0.000
10 quantile_75 -0.2722850 -0.4092971 1.000 0.000
10 quantile_90 0.4035077 0.2072104 1.000 0.000
30 mean -0.6661067 -1.0386845 1.000 0.000
30 std 0.7805748 0.6376537 1.000 0.000
30 SE 0.0123995 0.0101291 1.000 0.000
30 quantile_10 -1.4907011 -1.7392293 1.000 0.000
30 quantile_25 -1.2205584 -1.4780679 1.000 0.000
30 median -0.8291125 -1.1145536 1.000 0.000
30 quantile_75 -0.2722850 -0.7612880 1.000 0.000
30 quantile_90 0.4035077 -0.2119599 1.000 0.000
5 mean -0.6661067 -0.5990917 0.000 1.000
5 std 0.7805748 0.7831740 0.404 0.596
5 SE 0.0123995 0.0124407 0.404 0.596
5 quantile_10 -1.4907011 -1.4651840 0.012 0.988
5 quantile_25 -1.2205584 -1.1485910 0.000 1.000
5 median -0.8291125 -0.7650752 0.000 1.000
5 quantile_75 -0.2722850 -0.1606926 0.000 1.000
5 quantile_90 0.4035077 0.4906106 0.002 0.998
90 mean -0.6661067 -1.2487284 1.000 0.000
90 std 0.7805748 0.5775171 1.000 0.000
90 SE 0.0123995 0.0091739 1.000 0.000
90 quantile_10 -1.4907011 -1.8834868 1.000 0.000
90 quantile_25 -1.2205584 -1.6445014 1.000 0.000
90 median -0.8291125 -1.3337679 1.000 0.000
90 quantile_75 -0.2722850 -0.9981104 1.000 0.000
90 quantile_90 0.4035077 -0.5426452 1.000 0.000

St. Johns River

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 20 individuals (40 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 40 1000 -s $i | scr/MS/sample_stats > results/Ind20_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 40 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind20_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.3462437 -0.0439215 0.000 1.000
0 std 0.8460972 0.9696028 0.000 1.000
0 SE 0.0139722 0.0160117 0.000 1.000
0 quantile_10 -1.3165895 -1.1748057 0.000 1.000
0 quantile_25 -1.0313223 -0.8346280 0.000 1.000
0 median -0.4833589 -0.1388095 0.000 1.000
0 quantile_75 0.1752560 0.6228500 0.000 1.000
0 quantile_90 0.8826601 1.3525285 0.000 1.000
10 mean -0.3462437 -0.6666791 1.000 0.000
10 std 0.8460972 0.7852610 1.000 0.000
10 SE 0.0139722 0.0129676 1.000 0.000
10 quantile_10 -1.3165895 -1.4969545 1.000 0.000
10 quantile_25 -1.0313223 -1.1910799 1.000 0.000
10 median -0.4833589 -0.8360661 1.000 0.000
10 quantile_75 0.1752560 -0.2587536 1.000 0.000
10 quantile_90 0.8826601 0.4446643 1.000 0.000
30 mean -0.3462437 -0.9078948 1.000 0.000
30 std 0.8460972 0.7084764 1.000 0.000
30 SE 0.0139722 0.0116996 1.000 0.000
30 quantile_10 -1.3165895 -1.6739284 1.000 0.000
30 quantile_25 -1.0313223 -1.4184819 1.000 0.000
30 median -0.4833589 -1.1158471 1.000 0.000
30 quantile_75 0.1752560 -0.5674188 1.000 0.000
30 quantile_90 0.8826601 0.0790274 1.000 0.000
5 mean -0.3462437 -0.5227762 1.000 0.000
5 std 0.8460972 0.8177376 0.998 0.002
5 SE 0.0139722 0.0135039 0.998 0.002
5 quantile_10 -1.3165895 -1.4726406 1.000 0.000
5 quantile_25 -1.0313223 -1.1241070 1.000 0.000
5 median -0.4833589 -0.6596520 1.000 0.000
5 quantile_75 0.1752560 -0.0266447 1.000 0.000
5 quantile_90 0.8826601 0.5947087 1.000 0.000
90 mean -0.3462437 -1.1136465 1.000 0.000
90 std 0.8460972 0.6357307 1.000 0.000
90 SE 0.0139722 0.0104983 1.000 0.000
90 quantile_10 -1.3165895 -1.7732140 1.000 0.000
90 quantile_25 -1.0313223 -1.5276425 1.000 0.000
90 median -0.4833589 -1.1397679 1.000 0.000
90 quantile_75 0.1752560 -0.8367252 1.000 0.000
90 quantile_90 0.8826601 -0.2998549 1.000 0.000

Charleston Harbor

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 18 individuals (36 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 36 1000 -s $i | scr/MS/sample_stats > results/Ind18_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 36 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind18_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.3696008 -0.0600008 0.000 1.000
0 std 0.8538342 0.9702727 0.000 1.000
0 SE 0.0140750 0.0159945 0.000 1.000
0 quantile_10 -1.3529714 -1.2169483 0.000 1.000
0 quantile_25 -1.0847554 -0.8240396 0.000 1.000
0 median -0.5129384 -0.1662956 0.000 1.000
0 quantile_75 0.2122607 0.6158444 0.000 1.000
0 quantile_90 0.8667219 1.3503437 0.000 1.000
10 mean -0.3696008 -0.6523428 1.000 0.000
10 std 0.8538342 0.7878707 1.000 0.000
10 SE 0.0140750 0.0129877 1.000 0.000
10 quantile_10 -1.3529714 -1.4958990 1.000 0.000
10 quantile_25 -1.0847554 -1.1833798 1.000 0.000
10 median -0.5129384 -0.8141247 1.000 0.000
10 quantile_75 0.2122607 -0.2091295 1.000 0.000
10 quantile_90 0.8667219 0.4451190 1.000 0.000
30 mean -0.3696008 -0.8993336 1.000 0.000
30 std 0.8538342 0.7020571 1.000 0.000
30 SE 0.0140750 0.0115731 1.000 0.000
30 quantile_10 -1.3529714 -1.6506044 1.000 0.000
30 quantile_25 -1.0847554 -1.3928468 1.000 0.000
30 median -0.5129384 -1.0944458 1.000 0.000
30 quantile_75 0.2122607 -0.5261850 1.000 0.000
30 quantile_90 0.8667219 0.0357848 1.000 0.000
5 mean -0.3696008 -0.5071613 1.000 0.000
5 std 0.8538342 0.8304972 0.989 0.011
5 SE 0.0140750 0.0136903 0.989 0.011
5 quantile_10 -1.3529714 -1.4796425 1.000 0.000
5 quantile_25 -1.0847554 -1.1332130 1.000 0.000
5 median -0.5129384 -0.6749958 1.000 0.000
5 quantile_75 0.2122607 0.0270473 1.000 0.000
5 quantile_90 0.8667219 0.6706195 1.000 0.000
90 mean -0.3696008 -1.0936692 1.000 0.000
90 std 0.8538342 0.6518576 1.000 0.000
90 SE 0.0140750 0.0107456 1.000 0.000
90 quantile_10 -1.3529714 -1.7581458 1.000 0.000
90 quantile_25 -1.0847554 -1.5065644 1.000 0.000
90 median -0.5129384 -1.1380110 1.000 0.000
90 quantile_75 0.2122607 -0.8136294 1.000 0.000
90 quantile_90 0.8667219 -0.2310249 1.000 0.000

Winyah Bay

Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 18 individuals (36 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.


# simulated neutral loci
for i in {1..36}
do

  scr/MS/ms 36 1000 -s $i | scr/MS/sample_stats > results/Ind18_Sites${i}_a0_Rep1000.stats

done

# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do

  for i in {1..36}
  do

    scr/MS/ms 36 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind18_Sites${i}_a${a}_Rep1000.stats

  done

done

Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.

Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.

Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.

Cumulative distribution of Tajima's D for empirical and simulated data sets for a range of population expansion values.

Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.

Summary statistics of the distribution of Tajima’s D.

ALPHA PARAM OBS SIM PVALsmaller PVALlarger
0 mean -0.3282842 -0.0601800 0.000 1.000
0 std 0.8595211 0.9700697 0.000 1.000
0 SE 0.0143174 0.0161589 0.000 1.000
0 quantile_10 -1.3004334 -1.2193681 0.000 1.000
0 quantile_25 -1.0355594 -0.8234185 0.000 1.000
0 median -0.4826751 -0.1647173 0.000 1.000
0 quantile_75 0.2416686 0.6138551 0.000 1.000
0 quantile_90 0.8841898 1.3471081 0.000 1.000
10 mean -0.3282842 -0.6545048 1.000 0.000
10 std 0.8595211 0.7869050 1.000 0.000
10 SE 0.0143174 0.0131078 1.000 0.000
10 quantile_10 -1.3004334 -1.4960315 1.000 0.000
10 quantile_25 -1.0355594 -1.1904854 1.000 0.000
10 median -0.4826751 -0.8143589 1.000 0.000
10 quantile_75 0.2416686 -0.2111951 1.000 0.000
10 quantile_90 0.8841898 0.4382791 1.000 0.000
30 mean -0.3282842 -0.9001954 1.000 0.000
30 std 0.8595211 0.7017828 1.000 0.000
30 SE 0.0143174 0.0116899 1.000 0.000
30 quantile_10 -1.3004334 -1.6532718 1.000 0.000
30 quantile_25 -1.0355594 -1.3954230 1.000 0.000
30 median -0.4826751 -1.0942429 1.000 0.000
30 quantile_75 0.2416686 -0.5269851 1.000 0.000
30 quantile_90 0.8841898 0.0346313 1.000 0.000
5 mean -0.3282842 -0.5087689 1.000 0.000
5 std 0.8595211 0.8295508 0.998 0.002
5 SE 0.0143174 0.0138182 0.998 0.002
5 quantile_10 -1.3004334 -1.4810724 1.000 0.000
5 quantile_25 -1.0355594 -1.1332130 1.000 0.000
5 median -0.4826751 -0.6754370 1.000 0.000
5 quantile_75 0.2416686 0.0255215 1.000 0.000
5 quantile_90 0.8841898 0.6658911 1.000 0.000
90 mean -0.3282842 -1.0956612 1.000 0.000
90 std 0.8595211 0.6511864 1.000 0.000
90 SE 0.0143174 0.0108471 1.000 0.000
90 quantile_10 -1.3004334 -1.7599900 1.000 0.000
90 quantile_25 -1.0355594 -1.5099222 1.000 0.000
90 median -0.4826751 -1.1404943 1.000 0.000
90 quantile_75 0.2416686 -0.8139015 1.000 0.000
90 quantile_90 0.8841898 -0.2314143 1.000 0.000

6.2 Locus-by-locus comparison.

Calculates two p-values: (1) p-value assuming that \(D\) follows a normal distribution with mean zero and variance one (normal), (2) p-value assuming that \(D\) follows a beta distribution after rescaling on [0, 1] following Tajima 1989. Beta distribution is known to be overly conservative (i.e. fails to reject null hypothesis even when selection is occuring). The alternative is to simulate a null distribution for neutral loci using e.g. coalescent theory.

A total of 422 (12 after adjustment) loci were flagged as outlier in at least one estuary, 63 (5 after adjustment) loci were flagged as a positive outlier in at least one estuary, 361 (9 after adjustment) loci as negative outlier. A total of 167 (2 after adjustment) loci were flagged as both positive and negative outlier multiple estuaries.

Identify the number of loci that were flagged as negative and positive outlier in each estuary.

Number of loci flagged as significant positive or negative outlier per estuary.

Estuary NEGATIVE POSITIVE
SanAntonioBay 86 19
SabineLake 108 12
BaratariaBay 94 11
WestMississippiSound 93 14
MobileBay 89 13
ApalachicolaBay 80 19
StJohnsRiver 25 15
CharlestonHarbor 31 19
WinyahBay 20 21

Compare after adjusting per estuary.

Number of loci flagged as significant positive or negative outlier per estuary after adjusting by estuary.

Estuary NEGATIVE POSITIVE
SanAntonioBay 2 2
SabineLake 3 4
BaratariaBay 1 4
WestMississippiSound 4 4
MobileBay 4 4
ApalachicolaBay 1 4
StJohnsRiver 1 NA
CharlestonHarbor 1 1
WinyahBay NA 1

Identify loci that have been previously mapped on the linkage map to identify patterns of clustering if present.

Distribution of Tajima’s D per locus calculated for individuals within each estuary across the genome. Loci flagged as significant outlier compared to the expected null distribution under equilibrium are indicated in red.

A total of 155 loci flagged as significant outlier in at least one estuary were located on the map.

6.3 Comparison of long-term effective populations size using the population-scaled mutation rate \(\theta\)

\(\theta = 4N_{e}\mu\) is the population-scaled mutation rate, with \(\mu\) = mutation rate per bp per generation. Because of its relationship to the long-term effective population size it can be used as a proxy for \(N_{e}\) to understand the effects of drift. When \(N_{e}\) and \(\mu\) are unknown, \(\theta\) can be estimated on a locus-per-locus basis as \(\hat{\theta}_W\) using the number of segregating sites (observed nucleotide diversity \(S\)) or as \(\theta_{T}\) based on pairwise differences among haplotypes (nucleotide diversity \(\pi\)).

Watterson’s estimator (segregating sites)

Watterson’s estimator \(\theta_{W}\) (Watterson 1975Watterson, G. A. 1975. “On the number of segregating sites in genetical models without recombination.” Theoretical Population Biology 7 (2): 256–76. https://doi.org/10.1016/0040-5809(75)90020-9.) is an empirical way to measure genetic variation based on the number of segregating sites in a locus (observed nucleotide diversity). Despite the fact that estimates of \(\theta_{W}\) is biased for loci under selection and by population expansion it can be used as a proxy for relative differences in long-term effective population Ne size among populations even when a genome is not in equilibrium due to the theoretical relationship of \(\theta\) and Ne (\(\theta = 4N_{e}\mu\)).

Compare differences in distribution of \(\theta_{W}\) across all loci among estuaries.

Distribution of theta W per estuary.

Distribution of theta W per estuary.

Comparison of mean +/- standard deviation, minimum, and maximum values of theta W across loci among estuaries.

pop mean std max min
SanAntonioBay 0.4543470 0.3537153 2.696269 0
SabineLake 0.4814806 0.3642302 3.257992 0
BaratariaBay 0.5364753 0.3970729 3.595026 0
WestMississippiSound 0.5122540 0.3805040 3.257992 0
MobileBay 0.6024658 0.4263088 3.482681 0
ApalachicolaBay 0.5515221 0.4013758 2.920958 0
StJohnsRiver 0.3704835 0.3042725 2.022202 0
CharlestonHarbor 0.3596934 0.2960188 2.359236 0
WinyahBay 0.3531063 0.2952937 2.246891 0

Mean values of \(\theta_{W}\) are lower are lower in estuaries of the Atlantic compared to the Gulf of Mexico, implying that long-term effective population sizes are larger in Gulf estuaries compared to estuaries in the Atlantic.

Test for significant differences among ocean basins using a Mann-Whitney test (unpaired two-samples Wilcoxon test/Wilcoxon rank sum test).

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mean by OCEAN
## W = 0, p-value = 0.02819
## alternative hypothesis: true location shift is not equal to 0

Nucleotide diversity \(\pi\) (pairwise differences)

\(\theta_{T}\) (Nei 1987Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.) is calculated as the sum of the number of differences between pairs of haplotypes of a given locus divided by the number of comparisons made. This parameter is baised towards allele segregating at intermediate rates.

Compare differences in distribution of \(\theta_{T}\) across all loci among estuaries.

Distribution of theta T across all loci among estuaries.

Distribution of theta T across all loci among estuaries.

Comparison of mean +/- standard deviation, minimum, and maximum values of the nucleotide diversity (pi) across loci among estuaries.

pop mean std max min
SanAntonioBay 0.0021120 0.0019737 0.0147258 0
SabineLake 0.0021483 0.0019934 0.0151231 0
BaratariaBay 0.0021150 0.0019717 0.0170939 0
WestMississippiSound 0.0021052 0.0019699 0.0170587 0
MobileBay 0.0021120 0.0019506 0.0160608 0
ApalachicolaBay 0.0021141 0.0019686 0.0172357 0
StJohnsRiver 0.0020896 0.0019782 0.0151531 0
CharlestonHarbor 0.0020789 0.0019580 0.0155599 0
WinyahBay 0.0020789 0.0019930 0.0178757 0

Mean values of \(\theta_{T}\) are lower are marginally lower in estuaries of the Atlantic compared to the Gulf of Mexico, implying that long-term effective population sizes are larger in Gulf estuaries compared to estuaries in the Atlantic.

Test for significant differences among ocean basins using a Mann-Whitney test (unpaired two-samples Wilcoxon test/Wilcoxon rank sum test).

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mean by OCEAN
## W = 0, p-value = 0.02819
## alternative hypothesis: true location shift is not equal to 0

7 Landscape Genetics

Redundancy analyses (RDA), as implemented in vegan (Oksanen et al. 2013Oksanen, J, F. G Blanchet, R Kindt, P Legendre, P. R Minchin, R. B O’Hara, G. L Simpson, et al. 2013. “Package ‘vegan’. Community ecology package.”), was used to assess the influence of geographic distance and of variable environmental parameters and their interaction on observed patterns of genomic variation among samples from the Gulf. RDA was not carried out among samples from the Atlantic because of limited geographic spread.

RDA is a constrained ordination method that extracts and summarizes components of variation in a multidimensional data set explained by a set of explanatory variables. Its purpose is to parse and visualize components of genomic variation (response variables) that are explained by geography and/or environment (explanatory variables) and to identify alleles/loci driving an observed pattern. While applying it in this way were multiple individuals from the same location are assigned the same values for spatial and environmental variables may result in some bias due to pseud-replication, not transferring the response variables to a sample level, for example using a correspondance analysis allows for the application of the RDA as an association approach when using genomic data that may not rely on equilibrium assumptions present in \(F_{ST}\)-based analyses (Forester et al. 2018Forester, Brenna R., Jesse R. Lasky, Helene H. Wagner, and Dean L. Urban. 2018. “Comparing methods for detecting multilocus adaptation with multivariate genotype–environment associations.” Molecular Ecology 27 (9): 2215–33. https://doi.org/10.1111/mec.14584.). The R2 value obtained is equivalent to the proportion of genomic variation explained by explanatory variables, allowing for a comparison of the relative importance of the variables and their interaction.

7.1 Format matrices and select best models

Response variables: Genotype data

Format response variables (genetic data) as allele counts per individual.

RDA requires a complete data set. Replace missing values with mean allele frequency, then extract allele counts per individual.

Constraining matrix: Coastal distance (spatial impacts)

Geographic distance was measured as approximate coastline distance. For coastal fish species coastal distance more appropriate parameterization of geography as relationship might not be linear.

Format parameters to create spatial matrix by calculating polynomials (10th degree) geographic distance to fit trend surface. To reduce effects of pseudo-replication, jitter distance.

Select best model for spatial variables (coastal distance) using forward model selection and permutation tests (999 permutations).

# run initial rda
rda.xy = rda(allelecount ~ ., data.frame(xy), scale = FALSE)

# forward selection of spatial variables
stp.xy = ordiR2step(rda(allelecount ~ 1, data.frame(xy)),
                   scope = formula(rda.xy),
                   R2scope = FALSE,
                   scale = FALSE,
                   Pin = 0.15,
                   direction="forward",
                   R2permutations = 999,
                   parallel = 45)
## Step: R2.adj= 0 
## Call: allelecount ~ 1 
##  
##                  R2.adjusted
## + DIST_X3      0.00014093653
## + DIST_jitter  0.00006475115
## + DIST         0.00006432336
## + DIST_X1      0.00006432336
## <none>         0.00000000000
## + DIST_X2     -0.00002513934
## 
##           Df    AIC      F Pr(>F)  
## + DIST_X3  1 1932.6 1.0331   0.02 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.0001409365 
## Call: allelecount ~ DIST_X3 
##  
##                R2.adjusted
## + DIST_jitter 0.0002063814
## + DIST        0.0002061408
## + DIST_X1     0.0002061408
## <none>        0.0001409365
## + DIST_X2     0.0001162942
## 
##               Df    AIC      F Pr(>F)
## + DIST_jitter  1 1933.6 1.0153  0.186
selected_xy <- attributes(stp.xy$terms)$term.labels

Selected parameters for the best model is DIST_X3.

Constraining matrix: Environmental differences among estuaries

Data for 39 environmental parameters for each of the sample locations in the Gulf were obtained from the National Estuary Eutrophication Assessment database (NEAA 2018NEAA. 2018. “National Estuary Eutrophication Assessment.” http://ian.umces.edu/neea/siteinformation.php.). Parameters were then PCA-transformed, resulting in new synthetic variables summarizing environmental differences among estuaries. Forward selection of variables was performed on the synthetic variables to identify the best model of environmental variables and based on the results, the PC were used as the constraining matrix for the final RDA; significance was determined using 1,000 permutations.

Import environmental data from National Estuary Eutrophication Assessment to use as environmental predictors.

Parameters in data set include Estuary Area, Tidal Fresh Zone Area, Mixing Zone Area, Saltwater Zone Area, Estuary Volume, Estuary Depth, Estuary Perimeter, Tide Height, Tide Volume, Tide Volume/Day, Tide Ratio, Stratification Ratio, Percent Mixed Water, Percent Seawater, Average Salinity, Tidal Exchange, Daily Freshwater/Estuary Area, Daily Freshwater, Flow/Estuary Area, Total Freshwater Volume, Daily Precipitation, Daily Evaporation, Daily Precipitation/Estuary Area, Daily Evaporation/Estuary Area, Flow, Wind Speed, Sea Surface Temperature (mean), Sea Surface Temperature (std), Ocean Salinity (mean), Ocean Salinity (max), Ocean Salinity (min), Ocean Dissolved Inorganic Phosphorus, Oceanic NO3, Total Suspended Sediment, Total Nitrogen, Total Phosphorus, Total Suspended Sediment/Estuary Area, Total Nitrogen/Estuary Area, Total Phosphorus/Estuary Area, Wetland (km2) total, and Wetland (km2) percent.

Select environmental information for estuaries that have been sampled and perform PCA on scaled parameters.

Scree plot indicating percent variance summarized by each principle component.

Scree plot indicating percent variance summarized by each principle component.

PCA transform all significant environmental parameters and select model through forward selection using adjusted R2 as the stopping criterion.

## Step: R2.adj= 0 
## Call: allelecount ~ 1 
##  
##           R2.adjusted
## + PC8   0.00015395211
## + PC4   0.00011365471
## + PC6   0.00008836192
## + PC1   0.00007349204
## + PC3   0.00004668602
## + PC7   0.00001146825
## <none>  0.00000000000
## + PC2  -0.00004157001
## + PC5  -0.00010066119
## + PC9  -0.00010210653
## 
##       Df    AIC      F Pr(>F)  
## + PC8  1 1932.6 1.0362  0.024 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.0001539521 
## Call: allelecount ~ PC8 
##  
##          R2.adjusted
## + PC4  0.00025497145
## + PC1  0.00022665873
## + PC7  0.00019480392
## + PC3  0.00019328549
## + PC6  0.00017626019
## <none> 0.00015395211
## + PC2  0.00009336978
## + PC9  0.00005390553
## + PC5  0.00004739829
## 
##       Df    AIC      F Pr(>F)  
## + PC4  1 1933.6 1.0236  0.076 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.0002549714 
## Call: allelecount ~ PC8 + PC4 
##  
##         R2.adjusted
## + PC1  0.0003180708
## + PC6  0.0002782383
## + PC7  0.0002616803
## <none> 0.0002549714
## + PC3  0.0002520197
## + PC2  0.0001828849
## + PC9  0.0001697569
## + PC5  0.0001369291
## 
##       Df    AIC      F Pr(>F)
## + PC1  1 1934.5 1.0147   0.18

Selected parameters are PC8 and PC4.

Biplot of principal component analysis for selected PCs of environmental parameters for estuaries in the Western Gulf, central Gulf, and eastern Gulf.

Biplot of principal component analysis for selected PCs of environmental parameters for estuaries in the Western Gulf, central Gulf, and eastern Gulf.

Compare loadings on of variables on selected principle components

Loading of all environmental parameters on selected principle components, PC8 and PC4.

Loading of all environmental parameters on selected principle components, PC8 and PC4.

7.2 Variance Partitioning

Variance partitioning was used to compare the contribution of geographic distance and variation in the environmental differences to explain the observed genomic variation. A full model, using selected geographic and environmental variables, a partial model, using geographic data conditioned on environmental variables, and a partial model, using environmental data conditioned on geographic data, were considered to partition the explainable variance into individual (geography or environment) and shared components (geography plus environment), using vegan. Significance of each component was tested using 1,000 permutations.

Partition variance to determine if geographic location or environmental data driving observed pattern(s).

Partition the explainable variance (i.e. total inertia for constrain matrix of full model): determine total, constrained, unconstrained values of inertia/proportion of total inertia (variance).

Full model

Perform RDA for all variables (spatial and environmental; full model).

## Call: rda(formula = allelecount ~ xy.mat + env.mat, scale = FALSE)
## 
##                Inertia Proportion Rank
## Total         3571.558      1.000     
## Constrained     46.438      0.013    3
## Unconstrained 3525.120      0.987  232
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##   RDA1   RDA2   RDA3 
## 15.857 15.559 15.021 
## 
## Eigenvalues for unconstrained axes:
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 20.686 20.650 20.539 20.358 20.311 20.142 20.105 20.037 
## (Showing 8 of 232 unconstrained eigenvalues)

The proportion of variance explained by constrained PCA (rda) is 1.3%, adjusted R2 value is 0.0002391

Significance of overall model (costal distance + environment)

Df Variance F Pr(>F)
Model 3 46.43757 1.018738 0.023
Residual 232 3525.12011 NA NA

Proportion of variance explained by each RDA axis.

RDA1 RDA2 RDA3
Eigenvalue 15.856806 15.559418 15.021348
Proportion Explained 0.341465 0.335061 0.323474
Cumulative Proportion 0.341465 0.676526 1.000000

Test for significance of each constrained axis.

Df Variance F Pr(>F)
RDA1 1 15.85681 1.0435897 0.0509491
RDA2 1 15.55942 1.0240176 0.2687313
RDA3 1 15.02135 0.9886054 0.7572428
Residual 232 3525.12011 NA NA

The overall model is significant (P = 0.023).

Partition the variation in genetic data into components accounted for by environmental and spatial variables and their combined effect using adjusted R squared to assess partitions explained by explanatory tables and their combinations.

# Partition the variance ====
vpart.all <- varpart(allelecount, ~ xy.mat, ~ env.mat)


vpart.all
## 
## Partition of variance in RDA 
## 
## Call: varpart(Y = allelecount, X = ~xy.mat, ~env.mat)
## 
## Explanatory tables:
## X1:  ~xy.mat
## X2:  ~env.mat 
## 
## No. of explanatory tables: 2 
## Total variation (SS): 839316 
##             Variance: 3571.6 
## No. of observations: 236 
## 
## Partition table:
##                      Df R.squared Adj.R.squared Testable
## [a+b] = X1            1   0.00440       0.00014     TRUE
## [b+c] = X2            2   0.00876       0.00025     TRUE
## [a+b+c] = X1+X2       3   0.01300       0.00024     TRUE
## Individual fractions                                    
## [a] = X1|X2           1                -0.00002     TRUE
## [b]                   0                 0.00016    FALSE
## [c] = X2|X1           2                 0.00010     TRUE
## [d] = Residuals                         0.99976    FALSE
## ---
## Use function 'rda' to test significance of fractions of interest

Extract fraction explained by each component:

Test significance of all components using permuted p-values.

Variance partitioning of variance explained by coastal distance (xy), environmental variables (env), and shared effects due to correlation of xy and env.

PARTITION VARIANCE PVAL
residuals 0.9997609 NA
env+shared 0.0002550 0.005
xy+env+shared 0.0002391 0.029
shared 0.0001568 NA
xy+shared 0.0001409 0.020
env 0.0000982 0.200
xy -0.0000158 0.576

7.3 Clustering of individuals

Compare distribution of individuals using weighted average individual scores. WA are weighted averages of allele scores that are as similar to linear combinations (LC) scores as possible. Weights are site (individual) totals of species (loci). To determine how well explanatory variables separate groups of individuals or if explanatory variables can be used to discriminate between groups of individuals, use wa-scores (or both).

Biplot of redundancy analysis using PCA-transformed environmental variables and third polynomial of coastal distance as explanatory variables (red arrows). Individuals sampled in the Gulf (colored circles) are plotted according to their component loadings calculated as weighted average scores for individuals sampled in the Gulf.

Biplot of redundancy analysis using PCA-transformed environmental variables and third polynomial of coastal distance as explanatory variables (red arrows). Individuals sampled in the Gulf (colored circles) are plotted according to their component loadings calculated as weighted average scores for individuals sampled in the Gulf.

Compare distribution of individuals for environmental component only.

env <- gen_rda@strata %>%
  select(LIB_ID, ESTUARY) %>%
  mutate(ESTUARY = as.character(ESTUARY)) %>%
  left_join(env_pca_all) %>%
  select(LIB_ID, one_of(selected_env)) %>%
  column_to_rownames("LIB_ID")

rda.env <- rda(allelecount ~ ., data = env, scale = FALSE)

# environmental variables
env_rda <- as.data.frame((rda.env$CCA$biplot)) %>%
  rownames_to_column("ENV") %>%
  mutate(RDA1_flip = -(RDA1),
         RDA2_flip = -(RDA2))

# extract individual scores
ind_rda.env <- rda_indv(rda.env, 2) %>%
  left_join(strata) %>%
  mutate(ESTUARY = ordered(ESTUARY,
                           levels = c("SanAntonioBay", "MatagordaBay", "GalvestonBay", "SabineLake",
                     "BaratariaBay", "WestMississippiSound", "EastMississippiSound", "MobileBay",
                     "ApalachicolaBay")))

tmp <- ind_rda.env %>%
  mutate(RDA1_WA_SCALED3_flip = -(RDA1_WA_SCALED3),
         RDA2_WA_SCALED3_flip = -(RDA2_WA_SCALED3))

# plot
ggplot() +
  geom_vline(xintercept = 0, color = "darkblue", linetype = "dashed") +
  geom_hline(yintercept = 0, color = "darkblue", linetype = "dashed") +
  geom_label_repel(data = env_rda, aes(x = RDA1_flip, y = RDA2_flip, label = ENV)) +
  geom_segment(data = env_rda, aes(x = 0, y = 0, xend = RDA1_flip, yend = RDA2_flip),
               arrow = arrow(length = unit(0.2, "cm")),
               color="darkred", size = 1) +
  geom_point(data = tmp, aes(x = RDA1_WA_SCALED3_flip, RDA2_WA_SCALED3_flip, fill = ESTUARY),
             shape = 21, size = 3) +
  scale_fill_manual(values = col_estuaries) +
  labs(x = "RDA1", y = "RDA2") +
  theme_standard +
  theme(legend.position = "bottom")

7.4 Extract highly associated loci

Use the Mahalanobis distance to identify alleles with strongest associations to both RDA axes.

Allele loadings as (weighted) orthonormal scores for independent/unconstrained variables, alleles with Mahalanobis distance > 25 are indicated in red.

Allele loadings as (weighted) orthonormal scores for independent/unconstrained variables, alleles with Mahalanobis distance > 25 are indicated in red.

A total of 384 loci are flagged as significantly associated with the constraining variables (Mahalanobis distance > 25).

Distribution of Mahalanobis distance across the linkage map. For each locus allele with the highest Mahalanobis distance is plotted. Outlier loci are indicated in red.

Package versions used for this analysis (R 3.6.0).

##                         package loadedversion
## ade4                       ade4        1.7-15
## adegenet               adegenet         2.1.3
## ape                         ape         5.4-1
## assigner               assigner         0.5.6
## broom                     broom         0.5.6
## coin                       coin         1.3-1
## doMC                       doMC         1.3.6
## doParallel           doParallel        1.0.15
## doSNOW                   doSNOW        1.0.18
## dplyr                     dplyr         1.0.0
## forcats                 forcats         0.5.0
## foreach                 foreach         1.5.0
## ggplot2                 ggplot2         3.3.2
## ggrepel                 ggrepel         0.8.2
## ggthemes               ggthemes         4.2.0
## glue                       glue         1.4.1
## hierfstat             hierfstat        0.5-04
## Imap                       Imap          1.32
## iterators             iterators        1.0.12
## knitr                     knitr          1.29
## lattice                 lattice       0.20-41
## mapdata                 mapdata         2.3.0
## mapproj                 mapproj         1.2.7
## maps                       maps         3.3.0
## mmod                       mmod         1.3.3
## pegas                     pegas          0.13
## permute                 permute         0.9-5
## plotrix                 plotrix         3.7-8
## plyr                       plyr         1.8.6
## poppr                     poppr         2.8.6
## purrr                     purrr         0.3.4
## radiator               radiator         1.1.1
## randomForestSRC randomForestSRC         2.9.3
## readr                     readr         1.3.1
## reshape2               reshape2         1.4.4
## rgdal                     rgdal        1.5-10
## snow                       snow         0.4-3
## sp                           sp         1.4-2
## stringr                 stringr         1.4.0
## survival               survival         3.2-3
## tibble                   tibble         3.0.3
## tidyr                     tidyr         1.1.0
## tidyverse             tidyverse         1.3.0
## tint                       tint       0.1.2.2
## tufte                     tufte         0.6.1
## UpSetR                   UpSetR         1.4.0
## vcfR                       vcfR        1.11.0
## vegan                     vegan         2.5-6
## viridis                 viridis         0.5.1
## viridisLite         viridisLite         0.3.0